Unit Testing In Databricks Notebook

Unit Testing In Databricks Notebook

One of the most crucial yet ignored part of Software/Application development is Unit Testing particularly among Data Engineers who are responsible to build Business critical data pipelines. This improves the codebase, makes it more maintainable and less error-prone. Having said that, whatever few interviews I have taken over last couple of years particularly for projects where Databricks is involved, I have rarely come across anyone who has ever created a Unit Testing notebook. In fact, many are not even aware of how to achieve it.

This article will brief you about the approach that could be adopted to perform Unit Testing for overall effectiveness of the Data pipeline. We will be using the unittest framework of Python however there are other popular frameworks too as we speak.

Below screenshot is a very straightforward transformation logic that ingests the data using Spark API and performs a basic groupBy activity to calculate the total number of students belonging to a certain age and city.

Article content
Basic Group By Transformation

The purpose of the first function i.e. load_data() is to read data out of a file which will be passed to it as the second argument. The first argument is suppose to be a SparkSession object which when passed will be used to utilize Spark DataFrame native APIs.

The second function i.e. xform_data() takes the ownership of transforming the data i.e. first grouping the data using columns age and city and then doing a count over total number of students belonging to each of the groups thus obtained.

The dataset that we'll use to see how the functions work is as below.

Article content
Source Data

Now, we move on to create the main notebook which will utilize above Transformation notebook's functions to internally orchestrate the data pipeline.

Article content
Main Notebook

In the Main Notebook, the code in the first cell uses a built-in Databricks magic command (%run) which is used to run a python file or another notebook. Consider this to be analogous to the import keyword of python which is used to include another module in the code. The command is followed up with the path of the ./Transformation notebook.

In the second cell, we import some built-in pyspark functions and SparkSession which happens to be the entry point to Spark programming.

NOTE - Please note in Databricks as we spin up a cluster, Databricks internally creates a SparkSession object as spark.

The third cell is where we orchestrate the functions. The first function takes in the path of the file as argument and calls the Spark API to read the csv file. The output DataFrame is then passed on as the argument to xform_data() function which further returns the transformed DataFrame as a result.

Moving ahead, we now have the code ready and its time to perform Unit Testing. The test dataset that We'll use here is as below.

Article content
Test Dataset

We'll now create the Unit Test Notebook which will utilize unittest framework of python.

Article content
Article content

The first cell imports the Transformation notebook and its associated functions as explained above. The second cell is where the real stuff is which I will elaborate below.

We start the code by importing the unittest python framework, pyspark built-in function and SparkSession object. Next we create a class named "TransformationTestCase" which inherits objects from TestCase class. This is the base class to create our own test cases that enables us to run multiple test cases at once.

Next, the "setUpClass()" class method is a method which runs before all our test methods. Here, we use SparkSession to create an object which will be used to utilize built-in Spark DataFrame APIs. Please note always use the '@classmethod' decorator to define this.

Following two methods are where we do the tests. "test_datafile_loading()" method takes no arguments. It performs the task of reading the test data as DataFrame and run a basic count on all the records. The result is then passed on to the 'assertEqual()' function to compare the value which happens to be 8 in our case. The result of this function will be used to assert if the unit test has been passed successfully or not.

The second method i.e. "test_city_count()" is used to read the test data as DataFrame using load_data() function and then apply the xform_data() function to fetch the age and city wise count of students. The resultant DataFrame is then collected on to the driver node as a list using the collect() function. The reason to use the collect() function is to perform some native list functions which otherwise would have been a DataFrame having data distributed across multiple work nodes.

We then define an empty dictionary count_dict which will hold the city as the key and count of students as values. Next, we follow the same approach of using "assertEqual()" function to compare the data and return a boolean as True or False depending upon the result.

Article content
Suite Function - Aggregates all test functions

One of the advantages that other IDEs over Databricks is that they have a dedicated Run button which triggers the '__main__' function and subsequently runs the whole code. In Databricks, to achieve the same result we need to create a suite() function which aggregates all test methods and is executed as above. Please refer this link to further consolidate your understanding on this.

Now, as we execute the suite() function we would see the test check results of the two tests that we defined in the TransformationTestCase class. Had there been a failure, it would have shown the reasons.

I hope this small demo would have given you a good idea of how we can perform Unit Testing in Databricks.

Thanks for reading!! :) :)

To view or add a comment, sign in

More articles by aditya kishan

  • Salting - Handling Data Skewness

    One of the most common issues encountered during big data preocessing is Data Skewness. In such circumstances, the…

    3 Comments
  • How Spark works on a Cluster?

    As mentioned in the last article (https://dataengineeringfromscratch.blogspot.

  • What is Apache Spark?

    Data has become huge now. And in foreseen future, we expect it to grow exponentially.

Insights from the community

Others also viewed

Explore topics