10x Apache Spark performance improvement in Microsoft Fabric

10x Apache Spark performance improvement in Microsoft Fabric

Boosting Apache Spark Performance with Small JSON Files in Microsoft Fabric. Learn how to achieve a 10x performance improvement when ingesting small JSON files in Apache Spark hosted on Microsoft Fabric.

Ian Griffiths, Technical Fellow at endjin, shares insights and techniques to overcome Spark's challenges with numerous small files, including parallelizing file discovery and optimizing data loading. Follow along for detailed steps and tips to significantly enhance your Spark data processing workflows using Apache Spark in Microsoft Fabric.

  • 00:00 Introduction to Performance Improvement in Apache Spark
  • 00:20 Understanding the Problem with Small Files in Spark
  • 00:38 Our Scenario: Performance Telemetry Collection
  • 01:20 Initial Approach and Disappointment
  • 01:40 Exploring the Root Cause
  • 05:27 Parallelization: The Key to Performance Boost
  • 08:51 Implementing the Solution in Spark
  • 12:43 Conclusion: Balancing Complexity and Performance

Published on:

Learn more

We help small teams achieve big things.

Share post:

Related posts

Fabric Lab 03 Use delta tables in Apache Spark by taik18

In this video, @taik18 details the use of delta tables in Apache Spark through the lens of Fabric Lab 03. As part of the @microsoftfabric esse...

1 month ago

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

If you are working with PySpark data frames and need to add a current date time column to your existing data, this blog post can help. The pos...

2 months ago

Ingest Data with Spark & Microsoft Fabric Notebooks | Learn Together

This is a video tutorial aimed at guiding learners through the process of data ingestion using Spark and Microsoft Fabric notebooks for seamle...

2 months ago

Ingest Data with Spark & Fabric Notebooks | Learn Together

This video tutorial is a great resource for those looking to learn how to ingest data with Spark and Microsoft Fabric notebooks. The tutorial ...

2 months ago

Connect Power BI and Spark notebooks with Microsoft Fabric Semantic Link

The new Semantic Link feature in Microsoft Fabric is creating quite a buzz in the world of data analytics. With this feature, it is now possib...

9 months ago

Loading Files with Spark Structured Streaming in Microsoft Fabric

If you're working with Spark Structured Streaming in Microsoft Fabric, this article is a must-read. Here, you'll dive into the process of load...

9 months ago

Synapse - Choosing Between Spark Notebook vs Spark Job Definition

  Author(s): Arun Sethia and Arshad Ali are Program Managers in Azure Synapse Customer Success Engineering (CSE) team. Introduction Apac...

1 year ago

Writing data using Azure Synapse Dedicated SQL Pool Connector for Apache Spark

Summary    A common data engineering task is explore, transform, and load data into data warehouse using Azure Synapse Apache Spark....

1 year ago

Mastering DP-500 Exam: Explore data using Spark notebooks!

If you're prepping for the DP-500 Exam or just looking for an easy way to visualize your data, Synapse Analytics Spark pool has got you covere...

2 years ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy