Unit Testing In Databricks Notebook

aditya kishan

Data Engineering | Spark | Python | EMR | S3 | Databricks | Hive | HDFS |Teradata | Power BI | Azure/AWS

Published Feb 24, 2024

One of the most crucial yet ignored part of Software/Application development is Unit Testing particularly among Data Engineers who are responsible to build Business critical data pipelines. This improves the codebase, makes it more maintainable and less error-prone. Having said that, whatever few interviews I have taken over last couple of years particularly for projects where Databricks is involved, I have rarely come across anyone who has ever created a Unit Testing notebook. In fact, many are not even aware of how to achieve it.

This article will brief you about the approach that could be adopted to perform Unit Testing for overall effectiveness of the Data pipeline. We will be using the unittest framework of Python however there are other popular frameworks too as we speak.

Below screenshot is a very straightforward transformation logic that ingests the data using Spark API and performs a basic groupBy activity to calculate the total number of students belonging to a certain age and city.

Article content — Basic Group By Transformation

The purpose of the first function i.e. load_data() is to read data out of a file which will be passed to it as the second argument. The first argument is suppose to be a SparkSession object which when passed will be used to utilize Spark DataFrame native APIs.

The second function i.e. xform_data() takes the ownership of transforming the data i.e. first grouping the data using columns age and city and then doing a count over total number of students belonging to each of the groups thus obtained.

The dataset that we'll use to see how the functions work is as below.

Now, we move on to create the main notebook which will utilize above Transformation notebook's functions to internally orchestrate the data pipeline.

In the Main Notebook, the code in the first cell uses a built-in Databricks magic command (%run) which is used to run a python file or another notebook. Consider this to be analogous to the import keyword of python which is used to include another module in the code. The command is followed up with the path of the ./Transformation notebook.

In the second cell, we import some built-in pyspark functions and SparkSession which happens to be the entry point to Spark programming.

NOTE - Please note in Databricks as we spin up a cluster, Databricks internally creates a SparkSession object as spark.

The third cell is where we orchestrate the functions. The first function takes in the path of the file as argument and calls the Spark API to read the csv file. The output DataFrame is then passed on as the argument to xform_data() function which further returns the transformed DataFrame as a result.

Moving ahead, we now have the code ready and its time to perform Unit Testing. The test dataset that We'll use here is as below.

Recommended by LinkedIn

Issue #8: Marvelous MLOps

Marvelous MLOps 1 year ago

I created an ETL pipeline using Python, BigQuery, and…

Muskan Bansal 6 months ago

Python in Data Engineering: Powering Databricks…

Rafael Andrade 4 months ago

We'll now create the Unit Test Notebook which will utilize unittest framework of python.

The first cell imports the Transformation notebook and its associated functions as explained above. The second cell is where the real stuff is which I will elaborate below.

We start the code by importing the unittest python framework, pyspark built-in function and SparkSession object. Next we create a class named "TransformationTestCase" which inherits objects from TestCase class. This is the base class to create our own test cases that enables us to run multiple test cases at once.

Next, the "setUpClass()" class method is a method which runs before all our test methods. Here, we use SparkSession to create an object which will be used to utilize built-in Spark DataFrame APIs. Please note always use the '@classmethod' decorator to define this.

Following two methods are where we do the tests. "test_datafile_loading()" method takes no arguments. It performs the task of reading the test data as DataFrame and run a basic count on all the records. The result is then passed on to the 'assertEqual()' function to compare the value which happens to be 8 in our case. The result of this function will be used to assert if the unit test has been passed successfully or not.

The second method i.e. "test_city_count()" is used to read the test data as DataFrame using load_data() function and then apply the xform_data() function to fetch the age and city wise count of students. The resultant DataFrame is then collected on to the driver node as a list using the collect() function. The reason to use the collect() function is to perform some native list functions which otherwise would have been a DataFrame having data distributed across multiple work nodes.

We then define an empty dictionary count_dict which will hold the city as the key and count of students as values. Next, we follow the same approach of using "assertEqual()" function to compare the data and return a boolean as True or False depending upon the result.

One of the advantages that other IDEs over Databricks is that they have a dedicated Run button which triggers the '__main__' function and subsequently runs the whole code. In Databricks, to achieve the same result we need to create a suite() function which aggregates all test methods and is executed as above. Please refer this link to further consolidate your understanding on this.

Now, as we execute the suite() function we would see the test check results of the two tests that we defined in the TransformationTestCase class. Had there been a failure, it would have shown the reasons.

I hope this small demo would have given you a good idea of how we can perform Unit Testing in Databricks.

Unit Testing In Databricks Notebook

aditya kishan

Data Engineering | Spark | Python | EMR | S3 | Databricks | Hive | HDFS |Teradata | Power BI | Azure/AWS

Recommended by LinkedIn

More articles by aditya kishan

Insights from the community

Others also viewed

Understanding DataFrames in Python and PySpark

Choosing the Right ETL Tool for Actuaries: Power Query vs. Python

Building a Scalable Data Pipeline with dbt, Python, Podman, Airflow, and Ansible

How to load local files to AWS Redshift using Python and Unleash Insights with Power BI

10 Prompts for Databricks Assistant

How is this going to benefit me?

Automating Bulk CSV Data Uploads into Snowflake with Python and PowerBI Integration: A Comprehensive Guide

Building a Containerised Data Pipeline using Singer

Quick start with dbt: “Analytics engineering tools designed for analysts”

Airflow - Just because we can, does not mean we should

Explore topics

Recommended by LinkedIn

More articles by aditya kishan

Salting - Handling Data Skewness

How Spark works on a Cluster?

What is Apache Spark?

Insights from the community

Others also viewed

Understanding DataFrames in Python and PySpark

Choosing the Right ETL Tool for Actuaries: Power Query vs. Python

Building a Scalable Data Pipeline with dbt, Python, Podman, Airflow, and Ansible

How to load local files to AWS Redshift using Python and Unleash Insights with Power BI

10 Prompts for Databricks Assistant

How is this going to benefit me?

Automating Bulk CSV Data Uploads into Snowflake with Python and PowerBI Integration: A Comprehensive Guide

Building a Containerised Data Pipeline using Singer

Quick start with dbt: “Analytics engineering tools designed for analysts”

Airflow - Just because we can, does not mean we should

Explore topics