Getting Started with Spark (part 4) - Unit Testing

#apachespark #python

Alright quite a while ago (already counting years), I published a tutorial series focused on helping people getting started with Spark. Here is an outline of the previous posts:

Part 1 Getting Started - covers basics on distributed Spark architecture, along with Data structures (including the old good RDD collections (!), whose use has been kind of deprecated by Dataframes)
Part 2 intro to Dataframes
Part 3 intro to UDFs and Window Functions

In the meanwhile Spark has not decreased popularity, so I thought I continued updating the same series. In this post we cover an essential part of any ETL project, namely Unit testing.

For that I created a sample repository, which is meant to serve as boiler plate code for any new Python Spark project.

Let us browse through the main job script. Please note that the repository might contain updated version, which might defer in details with the next gist.

The previous gist recovers the same example used in the previous post on UDFs and Window Functions.

Here is an example how we could test our "amount_spent_udf" function:

Now note the first line on the unit tests script, which is the secret sauce to load a spark context for your unit tests. Bellow is the code that creates the "spark_session" object passed as an argument to the "test_amount_spent_udf" function.

And that is it. We strongly encourage you to have a look on the correspondent git repository, where we specify detailed instructions how to run it locally.

And that is it for today, hope it helped!