Azure Data Factory: How to split a file into multiple output files with Bicep

@description('Name of the Azure storage account that will contain the file we will split.') param storageAccountName string = 'storage${uniqueString(resourceGroup().id)}' resource storageAccount 'Microsoft.Storage/storageAccounts@2021-08-01' existing = { name: storageAccountName } resource dataFactoryLinkedService 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = { parent: dataFactory name: dataFactoryLinkedServiceName properties: { type: 'AzureBlobStorage' typeProperties: { connectionString: 'DefaultEndpointsProtocol=https;AccountName=${storageAccount.name};AccountKey=${storageAccount.listKeys().keys[0].value}' } } }

Bicep code to create our Azure Data Factory Data Flow

Based on the following reference “Microsoft.DataFactory factories/linkedservices” we will create the Azure Data Factory Data Flow that will split our file into multiple files.

@description('The Blob s name that will be splitted') param blobNameToSplit string = 'file.csv' @description('The Blob s folder path that will be splitted') param blobFolderToSplit string = 'input' @description('The Blob s folder path that will be splitted') param blobOutputFolder string = 'output' resource dataFactoryLinkedService 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = { parent: dataFactory name: dataFactoryLinkedServiceName properties: { type: 'AzureBlobStorage' typeProperties: { connectionString: 'DefaultEndpointsProtocol=https;AccountName=${storageAccount.name};AccountKey=${storageAccount.listKeys().keys[0].value}' } } } resource dataFactoryDataFlow 'Microsoft.DataFactory/factories/dataflows@2018-06-01' = { parent: dataFactory name: dataFactoryDataFlowName properties: { type: 'MappingDataFlow' typeProperties: { sources: [ { linkedService: { referenceName: dataFactoryLinkedService.name type: 'LinkedServiceReference' } name: 'source' description: 'File to split' } ] sinks: [ { linkedService: { referenceName: dataFactoryLinkedService.name type: 'LinkedServiceReference' } name: 'sink' description: 'Splitted data' } ] transformations: [] scriptLines: [ 'source(useSchema: false,' ' allowSchemaDrift: true,' ' validateSchema: false,' ' ignoreNoFilesFound: false,' ' format: \'delimited\',' ' container: \'${blobContainerName}\',' ' folderPath: \'${blobFolderToSplit}\',' ' fileName: \'${blobNameToSplit}\',' ' columnDelimiter: \',\',' ' escapeChar: \'\\\\\',' ' quoteChar: \'\\\'\',' ' columnNamesAsHeader: true) ~> source' 'source sink(allowSchemaDrift: true,' ' validateSchema: false,' ' format: \'delimited\',' ' container: \'${blobContainerName}\',' ' folderPath: \'output\',' ' columnDelimiter: \',\',' ' escapeChar: \'\\\\\',' ' quoteChar: \'\\\'\',' ' columnNamesAsHeader: true,' ' filePattern:(concat(\'${blobNameToSplit}\', toString(currentTimestamp(),\'yyyyMMddHHmmss\'),\'-[n].csv\')),' ' skipDuplicateMapInputs: true,' ' skipDuplicateMapOutputs: true,' ' partitionBy(\'${partitionType}\', ${numberOfPartition})) ~> sink' ] } } }

When using the az deployment what-if option we can see the following changes. This is really convenient to see the asked changes before applying them.

numberOfSplittedFiles=3 blobFolderToSplit="input" blobNameToSplit="file.csv" blobOutputFolder="output" resourceGroupName=myDataFactoryResourceGroup dataFactoryName=myDataFactoryName storageAccountName=myStorageAccountName blobContainerName=myStorageAccountContainerName az deployment group what-if \ --resource-group $resourceGroupName \ --template-file data-factory-data-flow-split-file.bicep \ --parameters dataFactoryName=$dataFactoryName \ storageAccountName=$storageAccountName \ blobContainerName=$blobContainerName \ numberOfPartition=$numberOfSplittedFiles \ blobFolderToSplit=$blobFolderToSplit \ blobNameToSplit=$blobNameToSplit \ blobOutputFolder=$blobOutputFolder

The Data Flow looks like the following screenshot where we can see the number of partition that will be created. In our context it corresponds to the number of csv files that will be generated from our input csv file.

The other trick here is to play with a file name pattern to manage the target files names.

The output files in this sample will be set to fit with the input file name, the current date and the output file iteration.

concat('file.csv', toString(currentTimestamp(),'yyyyMMddHHmmss'),'-[n].csv')

Split the file through the Pipeline

Through the procedure located here https://github.com/JamesDLD/bicep-data-factory-data-flow-split-file we have created an Azure Data Factory pipeline named “ArmtemplateSampleSplitFilePipeline”, you can trigger it to launch the Data Flow that will split the file.

The following screenshot illustrates the split result done through Azure Data Factory Data Flow.

Conclusion

Considering Bicep or any other Infrastructure as Code (IaC) tool ensures to gain efficiency and agility, its a real ramp up when designing infrastructures and it makes them reproducible and testable.

See You in the Cloud

Jamesdld

Published on: February 01, 2024

Learn more

Azure Developer Community Blog articles

Microsoft Copilot (Microsoft 365): [Copilot Extensibility] No-Code Publishing for Azure AI Foundry Agents to Microsoft 365 Copilot Agent Store

Developers can now publish Azure AI Foundry Agents directly to the Microsoft 365 Copilot Agent Store with a simplified, no-code experience. Pr...

3 hours ago

Azure Marketplace and AppSource: A Unified AI Apps and Agents Marketplace

The Microsoft AI Apps and Agents Marketplace is set to transform how businesses discover, purchase, and deploy AI-powered solutions. This new ...

2 days ago

Episode 413 – Simplifying Azure Files with a new file share-centric management model

Welcome to Episode 413 of the Microsoft Cloud IT Pro Podcast. Microsoft has introduced a new file share-centric management model for Azure Fil...

4 days ago

Bringing Context to Copilot: Azure Cosmos DB Best Practices, Right in Your VS Code Workspace

Developers love GitHub Copilot for its instant, intelligent code suggestions. But what if those suggestions could also reflect your specific d...

4 days ago

Build an AI Agentic RAG search application with React, SQL Azure and Azure Static Web Apps

Introduction Leveraging OpenAI for semantic searches on structured databases like Azure SQL enhances search accuracy and context-awareness, pr...

4 days ago

Announcing latest Azure Cosmos DB Python SDK: Powering the Future of AI with OpenAI

We’re thrilled to announce the stable release of Azure Cosmos DB Python SDK version 4.14.0! This release brings together months of innov...

7 days ago

How Azure CLI handles your tokens and what you might be ignoring

Running az login feels like magic. A browser pops up, you pick an account, and from then on, everything just works. No more passwords, no more...

7 days ago

Boost your Azure Cosmos DB Efficiency with Azure Advisor Insights

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database service, trusted for mission-critical workloads that demand high ava...

9 days ago

Microsoft Azure Fundamentals #5: Complex Error Handling Patterns for High-Volume Microsoft Dataverse Integrations in Azure

🚀 1. Problem Context When integrating Microsoft Dataverse with Azure services (e.g., Azure Service Bus, Azure Functions, Logic Apps, Azure SQ...

10 days ago

Using the Secret Management PowerShell Module with Azure Key Vault and Azure Automation

Automation account credential resources are the easiest way to manage credentials for Azure Automation runbooks. The Secret Management module ...

11 days ago

Blog image

Azure Developer Community Blog articles

Learn more

Azure Data Factory: How to split a file into multiple output files with Bicep

Introduction

Bicep code to create our linked service

Bicep code to create our Azure Data Factory Data Flow

Split the file through the Pipeline

Conclusion

Related posts

Microsoft Copilot (Microsoft 365): [Copilot Extensibility] No-Code Publishing for Azure AI Foundry Agents to Microsoft 365 Copilot Agent Store

Azure Marketplace and AppSource: A Unified AI Apps and Agents Marketplace

Episode 413 – Simplifying Azure Files with a new file share-centric management model

Bringing Context to Copilot: Azure Cosmos DB Best Practices, Right in Your VS Code Workspace

Build an AI Agentic RAG search application with React, SQL Azure and Azure Static Web Apps

Announcing latest Azure Cosmos DB Python SDK: Powering the Future of AI with OpenAI

How Azure CLI handles your tokens and what you might be ignoring

Boost your Azure Cosmos DB Efficiency with Azure Advisor Insights

Microsoft Azure Fundamentals #5: Complex Error Handling Patterns for High-Volume Microsoft Dataverse Integrations in Azure

Using the Secret Management PowerShell Module with Azure Key Vault and Azure Automation