Loading...

Azure Data Factory: How to split a file into multiple output files with Bicep

Azure Data Factory: How to split a file into multiple output files with Bicep

Introduction

 

In this article we will see how to split a csv file located in an Azure Storage Account through Azure Data Factory Data Flow.

 

We will do it through Azure Bicep in order to demonstrate the benefits of Infrastructure as Code (IaC) including:

  • Reviewing the planned infrastructure that will be deployed through the what-if feature.
  • Reproducible and testable infrastructures with templating deployments.

 

Jamesdld23_8-1706197123009.png

 

 

A complete procedure to deploy the following resources is available here: https://github.com/JamesDLD/bicep-data-factory-data-flow-split-file

  • Azure Storage Account
  • Uploading a test file that will be splitted
  • Azure Data Factory
  • Azure Data Factory Linked Service to connect Data Factory to our Storage Account
  • Azure Data Factory Data Flow that will split our file
  • Azure Data Factory Pipeline to trigger our Data Flow
  • Bonus: Azure Data Factory Pipeline to cleanup the output container for your demos

 

Bicep code to create our linked service

The following Bicep code demonstrates how to create a Storage Account Linked Service through Azure Bicep.

 

Every body might not be concerned by the following limitation but in order to make this demo accessible to everyone we will create the Storage Account linked service through a connection string instead of using a managed identity which is definitely what I usually recommend.

 

If your blob account enables soft delete, system-assigned/user-assigned managed identity authentication isn’t supported in Data Flow.
If you access the blob storage through private endpoint using Data Flow, note when system-assigned/user-assigned managed identity authentication is used Data Flow connects to the ADLS Gen2 endpoint instead of Blob endpoint. Make sure you create the corresponding private endpoint in ADF to enable access.

Source: Data Flow/User-assigned managed identity authentication

 

 

Let’s have a look at the Bicep code!

The only trick here is to grab an existing storage account and pass its connection string through Bicep without having any secret in your code.

 

 

 

@description('Name of the Azure storage account that will contain the file we will split.') param storageAccountName string = 'storage${uniqueString(resourceGroup().id)}' resource storageAccount 'Microsoft.Storage/storageAccounts@2021-08-01' existing = { name: storageAccountName } resource dataFactoryLinkedService 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = { parent: dataFactory name: dataFactoryLinkedServiceName properties: { type: 'AzureBlobStorage' typeProperties: { connectionString: 'DefaultEndpointsProtocol=https;AccountName=${storageAccount.name};AccountKey=${storageAccount.listKeys().keys[0].value}' } } }

 

 

Bicep code to create our Azure Data Factory Data Flow

 

Based on the following reference “Microsoft.DataFactory factories/linkedservices” we will create the Azure Data Factory Data Flow that will split our file into multiple files.

 

 

 

@description('The Blob s name that will be splitted') param blobNameToSplit string = 'file.csv' @description('The Blob s folder path that will be splitted') param blobFolderToSplit string = 'input' @description('The Blob s folder path that will be splitted') param blobOutputFolder string = 'output' resource dataFactoryLinkedService 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = { parent: dataFactory name: dataFactoryLinkedServiceName properties: { type: 'AzureBlobStorage' typeProperties: { connectionString: 'DefaultEndpointsProtocol=https;AccountName=${storageAccount.name};AccountKey=${storageAccount.listKeys().keys[0].value}' } } } resource dataFactoryDataFlow 'Microsoft.DataFactory/factories/dataflows@2018-06-01' = { parent: dataFactory name: dataFactoryDataFlowName properties: { type: 'MappingDataFlow' typeProperties: { sources: [ { linkedService: { referenceName: dataFactoryLinkedService.name type: 'LinkedServiceReference' } name: 'source' description: 'File to split' } ] sinks: [ { linkedService: { referenceName: dataFactoryLinkedService.name type: 'LinkedServiceReference' } name: 'sink' description: 'Splitted data' } ] transformations: [] scriptLines: [ 'source(useSchema: false,' ' allowSchemaDrift: true,' ' validateSchema: false,' ' ignoreNoFilesFound: false,' ' format: \'delimited\',' ' container: \'${blobContainerName}\',' ' folderPath: \'${blobFolderToSplit}\',' ' fileName: \'${blobNameToSplit}\',' ' columnDelimiter: \',\',' ' escapeChar: \'\\\\\',' ' quoteChar: \'\\\'\',' ' columnNamesAsHeader: true) ~> source' 'source sink(allowSchemaDrift: true,' ' validateSchema: false,' ' format: \'delimited\',' ' container: \'${blobContainerName}\',' ' folderPath: \'output\',' ' columnDelimiter: \',\',' ' escapeChar: \'\\\\\',' ' quoteChar: \'\\\'\',' ' columnNamesAsHeader: true,' ' filePattern:(concat(\'${blobNameToSplit}\', toString(currentTimestamp(),\'yyyyMMddHHmmss\'),\'-[n].csv\')),' ' skipDuplicateMapInputs: true,' ' skipDuplicateMapOutputs: true,' ' partitionBy(\'${partitionType}\', ${numberOfPartition})) ~> sink' ] } } }

 

 

 

When using the az deployment what-if option we can see the following changes. This is really convenient to see the asked changes before applying them.

 

 

 

numberOfSplittedFiles=3 blobFolderToSplit="input" blobNameToSplit="file.csv" blobOutputFolder="output" resourceGroupName=myDataFactoryResourceGroup dataFactoryName=myDataFactoryName storageAccountName=myStorageAccountName blobContainerName=myStorageAccountContainerName az deployment group what-if \ --resource-group $resourceGroupName \ --template-file data-factory-data-flow-split-file.bicep \ --parameters dataFactoryName=$dataFactoryName \ storageAccountName=$storageAccountName \ blobContainerName=$blobContainerName \ numberOfPartition=$numberOfSplittedFiles \ blobFolderToSplit=$blobFolderToSplit \ blobNameToSplit=$blobNameToSplit \ blobOutputFolder=$blobOutputFolder

 

 

 

Jamesdld23_7-1706196963855.png

 

 

The Data Flow looks like the following screenshot where we can see the number of partition that will be created. In our context it corresponds to the number of csv files that will be generated from our input csv file.

 

 

Jamesdld23_6-1706196915290.png

 

The other trick here is to play with a file name pattern to manage the target files names.

 

Jamesdld23_3-1706196727355.png

 

 

 

The output files in this sample will be set to fit with the input file name, the current date and the output file iteration.

 

 

 

concat('file.csv', toString(currentTimestamp(),'yyyyMMddHHmmss'),'-[n].csv')

 

 

 

 

Split the file through the Pipeline

 

Through the procedure located here https://github.com/JamesDLD/bicep-data-factory-data-flow-split-file we have created an Azure Data Factory pipeline named “ArmtemplateSampleSplitFilePipeline”, you can trigger it to launch the Data Flow that will split the file.

 

The following screenshot illustrates the split result done through Azure Data Factory Data Flow.


 

Jamesdld23_5-1706196812836.png

 

Conclusion

 

Considering Bicep or any other Infrastructure as Code (IaC) tool ensures to gain efficiency and agility, its a real ramp up when designing infrastructures and it makes them reproducible and testable.

 

See You in the Cloud

Jamesdld

Published on:

Learn more
Azure Developer Community Blog articles
Azure Developer Community Blog articles

Azure Developer Community Blog articles

Share post:

Related posts

General Availability of SQL FCI and AG Features SQL Server Enabled by Azure Arc

We have good news. Two business continuity features for SQL Server enabled by Azure Arc are now generally available: View Failover Cluster In...

2 hours ago

Azure VMware Solution using a public IP down to the NSX-T Edge; configure SNAT, No-SNAT & DNAT

Azure VMware Solution How To Series: Configuring NSX-T SNAT, No-SNAT & DNAT rules   Overview Requirements Lab Environment NAT Rules K...

10 hours ago

App attach for Azure Virtual Desktop now generally available

App attach for Azure Virtual Desktop allows IT admins to dynamically attach applications from an application package to a user session in Azur...

1 day ago

A Closer Look at Azure WAF’s Data Masking Capabilities for Azure Front Door

The Azure Web Application Firewall (WAF) on Azure Front Door offers centralized protection for your web applications against vulnerabilities a...

1 day ago

Azure Virtual Network Manager (AVNM) Mesh and Direct Connectivity are Generally Available!

Azure Virtual Network Manager's (AVNM) mesh connectivity configuration and direct connectivity option in the hub and spoke connectivity config...

1 day ago

Azure pricing: How to calculate costs of Azure products and services

In our previous blogs we explained the Azure pricing structure and how customers can estimate their project costs when migrating to Azure or b...

1 day ago

Azure Pricing: How to estimate Azure project costs

In the previous blog we explained how you can learn about Azure pricing with free services and a pay-as-you-go model. Next, lets understand ho...

1 day ago

Azure Pricing: How to navigate Azure pricing options and resources

In this blog we discussed customer pricing needs and how they match different phases of the cloud journey, and we provided various tools and r...

1 day ago

Azure pricing: How to optimize costs for your Azure workloads

In our previous blogs we explained the Azure pricing structure, how customers can calculate their costs when migrating or building in Azure, a...

1 day ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy