Azure Synapse Spark Notebook – Unit Testing
Author(s): Arun Sethia is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team.
Introduction
In this blog post, we will cover how to test and create unit test cases for Spark jobs developed using Synapse Notebook. This is an extension of my previous blog, Synapse - Choosing Between Spark Notebook vs Spark Job Definition, where we discussed selecting between Spark Notebook and Spark Job Definition. Unit testing is an automated approach that developers use to test individual self-contained code units. By verifying code behavior early, it helps to streamline coding practices for larger systems.
For spark job definition developer usually develops the code using the preferred IDE and deploys the compiled package binaries using Spark job definition. In addition, developers can use their choice of unit test framework (ScalaTest, pytest, etc.) to create test cases as part of their project codebase.
This blog is more focused on writing unit test cases for a Notebook so that you can test them before you roll them out to higher environments. The common programming languages used by Synapse Notebook are Python and Scala. Both languages follow functional and object-oriented programming paradigms.
I will refrain from getting into the deep-inside selection of the best programming paradigm for Spark programming; maybe some other day, we will pick this topic.
Code organization
The enterprise systems are modular, maintainable, configurable, and easy to test, apart from scalable and performant. In this blog, our focus will be on creating unit test cases for the Synapse Notebook in a modular and maintainable way.
Using Azure Synapse, we can organize the Notebook code in multiple ways using various configurations provided by Synapse.
- External Library - Libraries provide reusability and modularity to your application. It also helps to share business functions and enterprise code across multiple applications. Azure Synapse allows you to configure dependencies using library management. The Notebook can leverage installed packages within their jobs. We should avoid writing unit test cases for such an installed library inside the Notebook. A fair amount of test frameworks is available to create unit tests for those libraries within the library source code (or outside). The Notebook will leverage APIs from the installed libraries to orchestrate the business process.
Pros
|
|
Cons |
|
- Functions and unit test in different Notebook – Azure Synapse allows you to run/load a Notebook from another Notebook. Given that, you can create a reusable code part of a Notebook and write test cases part of another Notebook. Using continuous integration and source control, you can control versions and releases.
Pros
|
|
Cons |
|
- Functions and unit test in same Notebook – The difference between this and the earlier approach is that only creating functions and test cases should be part of the same Notebook. You can still use continuous integration and source control for versioning and releases.
Pros
|
|
Cons |
|
Unit test examples
This blog will cover some example codes using Approach#1 and Approach#2. The example codes are written in Scala; in the future, we will also add more code for PySpark.
An example project code is available on github; you can clone the github on your local computer.
- The businessfunctions folder has code related to the approach using an external library (package). The source code of business functions APIs and unit test cases for these functions are part of the same module.
- The Notebook folder has various Synapse Notebook used for these examples.
Unit test with external library
As we described earlier in this blog, this approach does not require us to write any unit test cases part of Notebook. Instead, the library source code and unit test cases coexist outside Notebook.
You can execute test cases and build a package library using the maven command. The target folder will have a business function jar, and the command console will show executed test cases (alternatively, you can download the pre-build jar)
Using your Synapse workspace studio; you can add the business function library.
Within Azure Synapse, an Apache Spark pool can leverage custom libraries that are uploaded as Workspace Packages.
Using Spark pool, the Notebook can use business library functions to build business processes. The source code of this Notebook is available on the GitHub.
Unit test - functions and unit test Notebook
As we described earlier in this blog, this approach will require a minimum of two Notebooks, one for the business functions and the other one for unit test cases. The example code is available inside notebook folder (git clone code).
The business functions are available inside the BusinessFunctionsLibrary Notebook and respective test cases are in UnitTestBusinessFunctionsLibrary Notebook.
Summary
Using multiple Notebooks or library approaches depends on your enterprise guidelines, individual choice, and timelines.
My next upcoming blog will explore more code and data quality in Azure Synapse.
Published on:
Learn moreRelated posts
Fabric Mirroring for Azure Cosmos DB: Public Preview Refresh Now Live with New Features
We’re thrilled to announce the latest refresh of Fabric Mirroring for Azure Cosmos DB, now available with several powerful new features that e...
Power Platform – Use Azure Key Vault secrets with environment variables
We are announcing the ability to use Azure Key Vault secrets with environment variables in Power Platform. This feature will reach general ava...
Validating Azure Key Vault Access Securely in Fabric Notebooks
Working with sensitive data in Microsoft Fabric requires careful handling of secrets, especially when collaborating externally. In a recent cu...
Azure Developer CLI (azd) – May 2025
This post announces the May release of the Azure Developer CLI (`azd`). The post Azure Developer CLI (azd) – May 2025 appeared first on ...
Azure Cosmos DB with DiskANN Part 4: Stable Vector Search Recall with Streaming Data
Vector Search with Azure Cosmos DB In Part 1 and Part 2 of this series, we explored vector search with Azure Cosmos DB and best practices for...