10x Apache Spark performance improvement in Microsoft Fabric
Boosting Apache Spark Performance with Small JSON Files in Microsoft Fabric. Learn how to achieve a 10x performance improvement when ingesting small JSON files in Apache Spark hosted on Microsoft Fabric.
Ian Griffiths, Technical Fellow at endjin, shares insights and techniques to overcome Spark's challenges with numerous small files, including parallelizing file discovery and optimizing data loading. Follow along for detailed steps and tips to significantly enhance your Spark data processing workflows using Apache Spark in Microsoft Fabric.
- 00:00 Introduction to Performance Improvement in Apache Spark
- 00:20 Understanding the Problem with Small Files in Spark
- 00:38 Our Scenario: Performance Telemetry Collection
- 01:20 Initial Approach and Disappointment
- 01:40 Exploring the Root Cause
- 05:27 Parallelization: The Key to Performance Boost
- 08:51 Implementing the Solution in Spark
- 12:43 Conclusion: Balancing Complexity and Performance
Published on:
Learn moreRelated posts
Fabric Lab 03 Use delta tables in Apache Spark by taik18
In this video, @taik18 details the use of delta tables in Apache Spark through the lens of Fabric Lab 03. As part of the @microsoftfabric esse...
How to add current DateTime to existing PySpark data frame in a Fabric Notebook
If you are working with PySpark data frames and need to add a current date time column to your existing data, this blog post can help. The pos...
Ingest Data with Spark & Microsoft Fabric Notebooks | Learn Together
This is a video tutorial aimed at guiding learners through the process of data ingestion using Spark and Microsoft Fabric notebooks for seamle...
Ingest Data with Spark & Fabric Notebooks | Learn Together
This video tutorial is a great resource for those looking to learn how to ingest data with Spark and Microsoft Fabric notebooks. The tutorial ...
Connect Power BI and Spark notebooks with Microsoft Fabric Semantic Link
The new Semantic Link feature in Microsoft Fabric is creating quite a buzz in the world of data analytics. With this feature, it is now possib...
Loading Files with Spark Structured Streaming in Microsoft Fabric
If you're working with Spark Structured Streaming in Microsoft Fabric, this article is a must-read. Here, you'll dive into the process of load...
Synapse - Choosing Between Spark Notebook vs Spark Job Definition
Author(s): Arun Sethia and Arshad Ali are Program Managers in Azure Synapse Customer Success Engineering (CSE) team. Introduction Apac...
Writing data using Azure Synapse Dedicated SQL Pool Connector for Apache Spark
Summary A common data engineering task is explore, transform, and load data into data warehouse using Azure Synapse Apache Spark....
Mastering DP-500 Exam: Explore data using Spark notebooks!
If you're prepping for the DP-500 Exam or just looking for an easy way to visualize your data, Synapse Analytics Spark pool has got you covere...