Today we'd like to share a small utility package for testing Dataframes on PySpark and Pandas.
If you are a fan of test-driven development and had a chance to work on PySpark (or Pandas) projects, you've probably had written tests similar to this one:
from datetime import datetime
from pyspark_test import assert_pyspark_df_equal
from your_module import calculate_result
def test_event_aggregation(spark):
schema = ["user_id", "even_type", "item_id", "event_time", "country", "dt"]
expected_df = spark.createDataFrame(
[
(123456, 'page_view', None, datetime(2017,12,31,23,50,50), "uk", "2017-12-31"),
(123456, 'item_view', 68471513, datetime(2017,12,31,23,50,55), "uk", "2017-12-31")
],
schema
)
result_df = calculate_result()
assert_pyspark_df_equal(expected_df, result_df)
It works OK for small applications, but when your project gets bigger, data gets more complicated and the amount of tests starts to grow, you might want a less tedious way to define test data.
Exacaster alumni Vaidas Armonas came up with an idea to represent Spark DataFrames as markdown tables. This idea materialized to a testing package markdown-frames. With this package the test, which was shown before, can be replaced with this one:
from pyspark_test import assert_pyspark_df_equal
from markdown_frames.spark_dataframe import spark_df
from your_module import calculate_result
def test_event_aggregation(spark):
input_data = """
| user_id | even_type | item_id | event_time | country | dt |
| bigint | string | bigint | timestamp | string | string |
| ---------- | ----------- | -------- | ------------------- | -------- | ----------- |
| 123456 | page_view | None | 2017-12-31 23:50:50 | uk | 2017-12-31 |
| 123456 | item_view | 68471513 | 2017-12-31 23:50:55 | uk | 2017-12-31 |
"""
expected_df = spark_df(input_data, spark)
result_df = calculate_result()
assert_pyspark_df_equal(expected_df, result_df)
It makes tests more readable and self-explanatory.
Everything looks almost the same, when you need to build a Dataframe for Pandas, you just need to use different function:
from markdown_frames.pandas_dataframe import pandas_df
Share in the comments, if you know any other convienient tips & tricks when writing PySpark (and Pandas) Unit tests.
Top comments (0)