Azure Data Factory: How to split a file into multiple output files with Bicep
Introduction
In this article we will see how to split a csv file located in an Azure Storage Account through Azure Data Factory Data Flow.
We will do it through Azure Bicep in order to demonstrate the benefits of Infrastructure as Code (IaC) including:
- Reviewing the planned infrastructure that will be deployed through the what-if feature.
- Reproducible and testable infrastructures with templating deployments.
A complete procedure to deploy the following resources is available here: https://github.com/JamesDLD/bicep-data-factory-data-flow-split-file
- Azure Storage Account
- Uploading a test file that will be splitted
- Azure Data Factory
- Azure Data Factory Linked Service to connect Data Factory to our Storage Account
- Azure Data Factory Data Flow that will split our file
- Azure Data Factory Pipeline to trigger our Data Flow
- Bonus: Azure Data Factory Pipeline to cleanup the output container for your demos
Bicep code to create our linked service
The following Bicep code demonstrates how to create a Storage Account Linked Service through Azure Bicep.
Every body might not be concerned by the following limitation but in order to make this demo accessible to everyone we will create the Storage Account linked service through a connection string instead of using a managed identity which is definitely what I usually recommend.
If your blob account enables soft delete, system-assigned/user-assigned managed identity authentication isn’t supported in Data Flow.
If you access the blob storage through private endpoint using Data Flow, note when system-assigned/user-assigned managed identity authentication is used Data Flow connects to the ADLS Gen2 endpoint instead of Blob endpoint. Make sure you create the corresponding private endpoint in ADF to enable access.
Source: Data Flow/User-assigned managed identity authentication
Let’s have a look at the Bicep code!
The only trick here is to grab an existing storage account and pass its connection string through Bicep without having any secret in your code.
Bicep code to create our Azure Data Factory Data Flow
Based on the following reference “Microsoft.DataFactory factories/linkedservices” we will create the Azure Data Factory Data Flow that will split our file into multiple files.
When using the az deployment what-if option we can see the following changes. This is really convenient to see the asked changes before applying them.
The Data Flow looks like the following screenshot where we can see the number of partition that will be created. In our context it corresponds to the number of csv files that will be generated from our input csv file.
The other trick here is to play with a file name pattern to manage the target files names.
The output files in this sample will be set to fit with the input file name, the current date and the output file iteration.
Split the file through the Pipeline
Through the procedure located here https://github.com/JamesDLD/bicep-data-factory-data-flow-split-file we have created an Azure Data Factory pipeline named “ArmtemplateSampleSplitFilePipeline”, you can trigger it to launch the Data Flow that will split the file.
The following screenshot illustrates the split result done through Azure Data Factory Data Flow.
Conclusion
Considering Bicep or any other Infrastructure as Code (IaC) tool ensures to gain efficiency and agility, its a real ramp up when designing infrastructures and it makes them reproducible and testable.
See You in the Cloud
Jamesdld
Published on:
Learn moreRelated posts
Microsoft Copilot (Microsoft 365): [Copilot Extensibility] No-Code Publishing for Azure AI Foundry Agents to Microsoft 365 Copilot Agent Store
Developers can now publish Azure AI Foundry Agents directly to the Microsoft 365 Copilot Agent Store with a simplified, no-code experience. Pr...
Azure Marketplace and AppSource: A Unified AI Apps and Agents Marketplace
The Microsoft AI Apps and Agents Marketplace is set to transform how businesses discover, purchase, and deploy AI-powered solutions. This new ...
Episode 413 – Simplifying Azure Files with a new file share-centric management model
Welcome to Episode 413 of the Microsoft Cloud IT Pro Podcast. Microsoft has introduced a new file share-centric management model for Azure Fil...
Bringing Context to Copilot: Azure Cosmos DB Best Practices, Right in Your VS Code Workspace
Developers love GitHub Copilot for its instant, intelligent code suggestions. But what if those suggestions could also reflect your specific d...
Build an AI Agentic RAG search application with React, SQL Azure and Azure Static Web Apps
Introduction Leveraging OpenAI for semantic searches on structured databases like Azure SQL enhances search accuracy and context-awareness, pr...
Announcing latest Azure Cosmos DB Python SDK: Powering the Future of AI with OpenAI
We’re thrilled to announce the stable release of Azure Cosmos DB Python SDK version 4.14.0! This release brings together months of innov...
How Azure CLI handles your tokens and what you might be ignoring
Running az login feels like magic. A browser pops up, you pick an account, and from then on, everything just works. No more passwords, no more...
Boost your Azure Cosmos DB Efficiency with Azure Advisor Insights
Azure Cosmos DB is Microsoft’s globally distributed, multi-model database service, trusted for mission-critical workloads that demand high ava...
Microsoft Azure Fundamentals #5: Complex Error Handling Patterns for High-Volume Microsoft Dataverse Integrations in Azure
🚀 1. Problem Context When integrating Microsoft Dataverse with Azure services (e.g., Azure Service Bus, Azure Functions, Logic Apps, Azure SQ...
Using the Secret Management PowerShell Module with Azure Key Vault and Azure Automation
Automation account credential resources are the easiest way to manage credentials for Azure Automation runbooks. The Secret Management module ...