Tuesday, 5 January 2016

Unit Testing Spark Job

During my time writing lots of mapreduce code and building distributed applications over hadoop framework, one thing i understood that it needs a new unit testing approach than standard java application testing one. As mapreduce code is executed in distributed hadoop framework, we can't do both functional and performance testing as part of unit testing. So, functional testing can be done by using MRUnit framework in which we can simulate working of mapper/reducer using mocked input data. But for performance testing of mapreduce job/workflow, we still have to run mapreduce job with different configuration and input data loads to get the best fit of all of them.

Sometime back when i was a developing distributed application using spark, i applied same testing approach that I described earlier for Hadoop applications. I am not sure if it is by intent by spark developers but unit testing a spark job is much more obvious and easier to think & implement than mapreduce job.

Spark job can be unit tested using following "local" deploy modes.

Deploy Mode
Description
local
Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K]
Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*]
Run Spark locally with as many worker threads as logical cores on your machine.

As first deploy mode, "local" run with one worker thread, there will be 1 executor only. It can be used to do functional unit testing.
After doing functional testing, we can run the code with second deploy mode, "local[K]". For example, if we run with local[2], meaning two worker threads - which represents “minimal” parallelism. It can help in detecting bugs that only exist when we run in a distributed context.

Another major usability factor is the ease of testing the code with these deploy modes. In contrast to MRUnit which is a third party dependency and needs writing lot of boilerplate code, spark deploy modes are sweet and simple.
In Junit testcase for that spark job, you can use "setMaster" method of sparkConf object to set deploy mode like below:

SparkConf sparkConf = new SparkConf().setAppName("Testcase-1");
sparkConf.setMaster("local[2]");


For performance testing, the process is similar to one adopted in Hadoop. I ran the spark job with different values of configuration properties and data loads to get the best fit.


1 comment: