Parts of Spark jobs that do pure-functional transformations on dataframes or RDDs – independent of I/O – are ideal candidates for unit testing.
I demonstrated this in a trivial project here . This project has two main classes: one that takes a dataframe and generates a new dataframe with an additional column; and another that tests that transformation. This uses another library, which provides a trait (SharedSparkContext) for setting up a default local-testing SparkContext. Spark will then run in local mode on a workstation without any Hadoop ecosystem dependencies.
The test can be run through sbt or through Eclipse with Scala IDE.
To get up and running from a clean Eclipse environment (assuming you already have Eclipse and a JDK installed):
- install Scala
- install sbt
- install Scala-IDE for Eclipse (should include scalatest support)
- The project already includes a plugins.sbt file with dependency on sbteclipse plugin, so that should get picked up when you run “sbt eclipse” for the first time.
- “sbt test” will run tests, or you can run them through Eclipse with “Run As…” (Ctrl-F11)