Spark Records is a data processing pattern with an associated lightweight, dependency-free framework for Apache Spark v2+ that enables:
-
Bulletproof data processing with Spark
Your jobs will never unpredictably fail midway due to data transformation bugs. Spark records give you predictable failure control through instant data quality checks performed on metrics automatically collected during job execution, without any additional querying. -
Automatic row-level structured logging
Exceptions generated during job execution are automatically associated with the data that caused the exception, down to nested exception causes and full stack traces. If you need to reprocess data, you can trivially and efficiently choose to only process the failed inputs. -
Lightning-fast root cause analysis
Get answers to any questions related to exceptions or warnings generated during job execution directly using SparkSQL or your favorite Spark DSL. Would you like to see the top 5 issues encountered during job execution with example source data and the line in your code that caused the problem? You can.
Spark Records has been tested with petabyte-scale data at Swoop. The library was extracted out of Swoop's production systems to share with the Spark community.
See the documentation for more information or watch the Spark Summit talk (slides).
Just add the following to your libraryDependencies
in SBT:
resolvers += Resolver.bintrayRepo("swoop-inc", "maven")
libraryDependencies += "com.swoop" %% "spark-records" % "<version>"
You can find all released versions here.
Contributions and feedback of any kind are welcome.
Spark Records is maintained by Sim Simeonov and the team at Swoop.
Special thanks to Reynold Xin and Michael Armbrust for many interesting conversations about better ways to use Spark.
Build docs microsite
sbt "project docs" makeMicrosite
Run docs microsite locally (run under target/site
folder)
jekyll serve -b /spark-records
spark-records
is Copyright © 2017 Simeon Simeonov and Swoop, Inc. It is free software, and may be redistributed under the terms of the LICENSE.