Loading...

Azure Synapse Spark Notebook – Unit Testing

Azure Synapse Spark Notebook – Unit Testing

asethia_0-1674683863066.png

 

Author(s): Arun Sethia is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team.

 

Introduction

In this blog post, we will cover how to test and create unit test cases for Spark jobs developed using Synapse Notebook. This is an extension of my previous blog, Synapse - Choosing Between Spark Notebook vs Spark Job Definition, where we discussed selecting between Spark Notebook and Spark Job Definition. Unit testing is an automated approach that developers use to test individual self-contained code units. By verifying code behavior early, it helps to streamline coding practices for larger systems.

 

For spark job definition developer usually develops the code using the preferred IDE and deploys the compiled package binaries using Spark job definition. In addition, developers can use their choice of unit test framework (ScalaTest, pytest, etc.) to create test cases as part of their project codebase. 

 

This blog is more focused on writing unit test cases for a Notebook so that you can test them before you roll them out to higher environments. The common programming languages used by Synapse Notebook are Python and Scala. Both languages follow functional and object-oriented programming paradigms.

 

I will refrain from getting into the deep-inside selection of the best programming paradigm for Spark programming; maybe some other day, we will pick this topic.

 

Code organization

The enterprise systems are modular, maintainable, configurable, and easy to test, apart from scalable and performant. In this blog, our focus will be on creating unit test cases for the Synapse Notebook in a modular and maintainable way.

 

Using Azure Synapse, we can organize the Notebook code in multiple ways using various configurations provided by Synapse.  

 

  • External Library - Libraries provide reusability and modularity to your application. It also helps to share business functions and enterprise code across multiple applications. Azure Synapse allows you to configure dependencies using library management. The Notebook can leverage installed packages within their jobs. We should avoid writing unit test cases for such an installed library inside the Notebook. A fair amount of test frameworks is available to create unit tests for those libraries within the library source code (or outside). The Notebook will leverage APIs from the installed libraries to orchestrate the business process.

asethia_1-1674684108592.png

Pros

 

  • It ensures that developers follow the enterprise guidelines from business (for example, computing net amount from a retail order or validation of data like phone number, etc.) and software engineering best practices (code coverage, styling, etc.).
  • Easy to integrate unit test framework either part of the library or outside the Notebook.
  • You can use the same library outside of Notebook as well (for example Spark Job Definition)
  • Easy to integrate various quality plugins part of IDE and build process, like code coverage, linter, code style, etc.

Cons

  • This approach would require constant library versions and enterprise governance.
  • Additional build tools (maven/sbt/gradle/setuptools) are required.
  • Local development environment setup

 

  • Functions and unit test in different Notebook – Azure Synapse allows you to run/load a Notebook from another Notebook. Given that, you can create a reusable code part of a Notebook and write test cases part of another Notebook. Using continuous integration and source control, you can control versions and releases.

asethia_2-1674684273967.png

 

Pros

 

  • Easy to develop using Notebook without any additional build tools.
  • Quick and easy to integrate with other Notebooks.
  • You don’t need any desktop IDE (integrated development environment), Synapse notebooks are integrated with the Monaco editor to bring IDE-style IntelliSense to the cell editor.

Cons

  • It restricts the scope of code reusability to Notebook; you can’t use the code written in Notebook outside of the Notebook (like Spark Job Definition)
  • It is difficult to maintain as the number of Notebooks grows because an additional Test case Notebook is needed for each business function.
  • You are restricted to testing via Notebook only.
  • No direct support for linter, code coverage, styling, etc.

 

  • Functions and unit test in same Notebook – The difference between this and the earlier approach is that only creating functions and test cases should be part of the same Notebook. You can still use continuous integration and source control for versioning and releases.

asethia_3-1674684476775.png

 

Pros

 

  • Easy to develop using Notebook without any additional build tools.
  • Easy to maintain compared to earlier approach (Functions and unit test in different Notebook), a smaller number of Notebooks.
  • Easy to refer test cases and business functions with a Notebook
  • You don’t need any desktop IDE (integrated development environment), Synapse notebooks are integrated with the Monaco editor to bring IDE-style IntelliSense to the cell editor.

Cons

  • It restricts the scope of code reusability to Notebook; you can’t use the code written in Notebook outside of the Notebook (like Spark Job Definition)
  • You are restricted to testing via Notebook only.
  • Additional code to skip test cases in the production (maybe comment it out or use custom annotation, etc.)
  • No direct support for linter, code coverage, styling, etc.

 

Unit test examples

This blog will cover some example codes using Approach#1 and Approach#2. The example codes are written in Scala; in the future, we will also add more code for PySpark.

 

An example project code is available on github; you can clone the github on your local computer.

  • The businessfunctions folder has code related to the approach using an external library (package). The source code of business functions APIs and unit test cases for these functions are part of the same module.
  • The Notebook folder has various Synapse Notebook used for these examples.

 

Unit test with external library

As we described earlier in this blog, this approach does not require us to write any unit test cases part of Notebook. Instead, the library source code and unit test cases coexist outside Notebook.

 

You can execute test cases and build a package library using the maven command. The target folder will have a business function jar, and the command console will show executed test cases (alternatively, you can download the pre-build jar)

 

asethia_4-1674684598082.png

 

Using your Synapse workspace studio; you can add the business function library.

 

asethia_5-1674684615946.png

 

Within Azure Synapse, an Apache Spark pool can leverage custom libraries that are uploaded as Workspace Packages.

 

asethia_6-1674684645979.png

 

Using Spark pool, the Notebook can use business library functions to build business processes. The source code of this Notebook is available on the GitHub.

 

asethia_7-1674684664855.png

 

 

Unit test - functions and unit test Notebook

As we described earlier in this blog, this approach will require a minimum of two Notebooks, one for the business functions and the other one for unit test cases. The example code is available inside notebook folder (git clone code).

 

The business functions are available inside the BusinessFunctionsLibrary Notebook and respective test cases are in UnitTestBusinessFunctionsLibrary Notebook.

asethia_8-1674684705027.png

Summary

Using multiple Notebooks or library approaches depends on your enterprise guidelines, individual choice, and timelines.

 

My next upcoming blog will explore more code and data quality in Azure Synapse.

 

 

Published on:

Learn more
Azure Synapse Analytics Blog articles
Azure Synapse Analytics Blog articles

Azure Synapse Analytics Blog articles

Share post:

Related posts

Azure Verified Modules - Monthly Update [April]

In the April edition of the Azure Verified Modules update, the AVM team announces their upcoming quarterly community call scheduled for 21st M...

1 hour ago

Microsoft Purview compliance portal: Information Protection – Sensitivity labels protection policy support for Azure SQL, Azure Storage, and Amazon S3

Microsoft Purview Information Protection now supports label-based protection for Azure SQL, Azure Data Lake Storage, and Amazon S3 buckets. Wi...

4 hours ago

Centralized private resolver architecture implementation using Azure private DNS resolver

This article walks you through the steps to setup a centralized architecture to resolve DNS names, including private DNS zones across your Azu...

9 hours ago

Azure VMware Solution - Using Log Analytics With NSX-T Firewall Logs

Azure VMware Solution How To Series: Monitoring Azure VMware Solution   Overview Requirements Lab Environment Tagging & Groups Kusto ...

20 hours ago

Troubleshoot your apps faster with App Service using Microsoft Copilot for Azure | Azure Friday

This video provides you with a comprehensive overview of how to troubleshoot your apps faster with App Service utilizing Microsoft Copilot for...

3 days ago

Looking to optimize and manage your cloud resources? Join our Azure optimization skills challenge!

If you're looking for an effective way to optimize and manage your cloud resources, then join the Azure Optimization Cloud Skills Challenge or...

3 days ago

Have a safe coffee chat with your documentation using Azure AI Services | JavaScript Day 2024

  In the Azure Developers JavaScript Day 2024, Maya Shavin a Senior Software Engineer at Microsoft, presented a session c...

3 days ago

Azure Cosmos DB Keyboard Shortcuts for Faster Workflows | Data Explorer

Azure Cosmos DB Data Explorer just got a whole lot easier to work with thanks to its new keyboard shortcuts. This update was designed to make ...

3 days ago

How to Use Azure Virtual Network Manager's UDR Management Feature

What will you learn in this blog? What is Azure Virtual Network Manager’s UDR management feature? How UDR management simplifies route setting...

4 days ago

Secure & Reliable Canonical Workloads on Azure | GA Availability

With Azure's partnership with Canonical, the industry standard for patching Linux distributions on the cloud is elevated. The collaboration hi...

4 days ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy