Designing and Implementing Modern Data Architecture on Azure Cloud.

I just completed work on the digital transformation, design, development, and delivery of a cloud native data solution for one of the biggest professional sports organizations in north America.

In this post, I want to share some thoughts on the selected architecture and why we settled on it.

This Architecture was chosen to meet the following requirements gathered through a very involved and collaborative process with the customer:

It should be open and flexible. Allowing for the ability to build and deliver solutions using the best services and cloud native products available.
The architecture should enable secure integration between products/tools/services to allow for efficient running, scalability, and support for a variety of data formats.
It should enable insights by simplifying the ability to build analytics dashboards and operational reports.
It should unify data, analytics, and ML workloads.

The architecture is shown in the following screenshot:

Azure-Based Moder Data Architecture.png

This solution meets these requirements by integrating:

Azure Databricks:

Built on the open-source Apache Spark and Delta Lake. Databricks is capable of efficiently handling both batch and near real-time data workloads as required in this project.

A combination of Spark Structured streaming with the trigger once option and Databricks Autoloader enabled us to develop for near real-time processing scenarios, while reusing the same codebase to meet batch processing requirements when and where necessary. In addition, Databricks Autoloader enables incremental file processing, without having to separately set up corresponding infrastructure for storage queues and event grid.

The exciting Delta Lake technology that implements the Lakehouse usage pattern enables a number of exciting features that have facilitated the implementation of schema evolution, merge, updates and inserts to data lake files. The Python implementation of these features allowed us to take advantage of existing skill sets and quickly meet business requirements without writing overly long and complex pipeline code.

Use of the same code base helped efficiently manage the ingestion and processing of both workloads. Data source files are organized into RAW, PROCESSED and ANALYTICS zones. Databricks reads from the RAW zone, does the data cleansing and transformation, then outputs the resulting Dataframe to the processed zone.

Further enrichments are performed on the processed zone files and output to the Analytics zone. This flow matches the medallion design of bronze, silver, and gold zones.

Azure Data Factory and Azure Data Lake Gen 2:

We provisioned Azure Data Factory within its managed VNET. It’s also configured with private endpoints to enable secure, private integration with both instances of Azure Data Lake. Two data lakes were set up to isolate traffic and access between the external facing lake for 3^rd party access and the inside facing data lake. The ADF private endpoints ensure that traffic between these two instances is isolated.

IP Address whitelisting was set up on the outside-facing data lake firewall to control 3^rd party access. The inside data lake is accessible via private endpoints and restricted VNETs.

In addition, we provisioned an Azure virtual machine to host the ADF self-hosted runtime. This was used for secure, non-public access to Azure Databricks. Azure Databricks does not support private endpoints at this time. Details of this setup will be addressed in a future post.

We provisioned Data Lake containers into RAW, PROCESSED and ANALYTICS zones, with appropriate RBACs and ACLs configured for clearly defined service principals where necessary for security isolation and access control.

The data flow is that ADF writes to the RAW zone in the internal data lake and Databricks/Apache Spark reads from this zone, does the data cleansing and transformation, then outputs the resulting Dataframe to the processed zone and the analytics zone based on the aggregation requirements of the business.

Azure Synapse Analytics:

We primarily used the data warehousing sub resource SQL dedicated pool for structured data storage. Transformed, structured and other profile data from Azure Databricks is written to Synapse dedicated SQL pool, and Azure Cosmos DB as needed.

The workspace was deployed within its managed VNET. Secure access via Synapse Studio is ensured via Synapse private link hub and private endpoint. Other Synapse sub resources, dev, serverless SQL and dedicated SQL pool were configured with private endpoints terminating in the same restricted VNET as the Azure Databricks deployment, but in separate subnets, enabling secure Dataframe writes from Azure Databricks to Synapse dedicated SQL pool.

Azure Monitor (Log Analytics)

I developed custom PySpark and native Python functions to capture Spark Structured streaming metrics to be published to Azure Log Analytics. This will enable Kusto queries, dashboards, and alerts for monitoring pipeline thresholds.

The solution consists of two Python functions that extract Spark Structured streaming metrics from the Streaming query, then writes them to Azure monitor Log Analytics via a REST API endpoint. I packaged these functions into a Python wheel file to make them easily reusable and deployable to a Databricks cluster.

Power BI Workspace:

Power BI is connected to the Data Lake, Azure Synapse Analytics and Azure Databricks dbfs via VNET integration and private endpoints.

The network design was based on a hub and spoke model with VNET peering and the services in this solution use Azure Active Directory (Azure AD) to authenticate users.

In future posts, as bandwidth permits, I hope to share code samples to demonstrate the automated deployment and configuration of some of the services included in this architecture.

Published on: February 06, 2023

Learn more

Azure Architecture Blog articles

Azure Marketplace and AppSource: A Unified AI Apps and Agents Marketplace

The Microsoft AI Apps and Agents Marketplace is set to transform how businesses discover, purchase, and deploy AI-powered solutions. This new ...

1 day ago

Episode 413 – Simplifying Azure Files with a new file share-centric management model

Welcome to Episode 413 of the Microsoft Cloud IT Pro Podcast. Microsoft has introduced a new file share-centric management model for Azure Fil...

3 days ago

Bringing Context to Copilot: Azure Cosmos DB Best Practices, Right in Your VS Code Workspace

Developers love GitHub Copilot for its instant, intelligent code suggestions. But what if those suggestions could also reflect your specific d...

3 days ago

Build an AI Agentic RAG search application with React, SQL Azure and Azure Static Web Apps

Introduction Leveraging OpenAI for semantic searches on structured databases like Azure SQL enhances search accuracy and context-awareness, pr...

3 days ago

Announcing latest Azure Cosmos DB Python SDK: Powering the Future of AI with OpenAI

We’re thrilled to announce the stable release of Azure Cosmos DB Python SDK version 4.14.0! This release brings together months of innov...

6 days ago

How Azure CLI handles your tokens and what you might be ignoring

Running az login feels like magic. A browser pops up, you pick an account, and from then on, everything just works. No more passwords, no more...

6 days ago

Boost your Azure Cosmos DB Efficiency with Azure Advisor Insights

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database service, trusted for mission-critical workloads that demand high ava...

8 days ago

Microsoft Azure Fundamentals #5: Complex Error Handling Patterns for High-Volume Microsoft Dataverse Integrations in Azure

🚀 1. Problem Context When integrating Microsoft Dataverse with Azure services (e.g., Azure Service Bus, Azure Functions, Logic Apps, Azure SQ...

9 days ago

Using the Secret Management PowerShell Module with Azure Key Vault and Azure Automation

Automation account credential resources are the easiest way to manage credentials for Azure Automation runbooks. The Secret Management module ...

10 days ago

Microsoft Azure Fundamentals #4: Azure Service Bus Topics and Subscriptions for multi-system CRM workflows in Microsoft Dataverse / Dynamics 365

🚀 1. Scenario Overview In modern enterprise environments, a single business event in Microsoft Dataverse (CRM) can trigger workflows across m...

10 days ago

Blog image

Azure Architecture Blog articles

Learn more

Designing and Implementing Modern Data Architecture on Azure Cloud.

Related posts