Loading...

10x Apache Spark performance improvement in Microsoft Fabric

10x Apache Spark performance improvement in Microsoft Fabric

Boosting Apache Spark Performance with Small JSON Files in Microsoft Fabric. Learn how to achieve a 10x performance improvement when ingesting small JSON files in Apache Spark hosted on Microsoft Fabric.

Ian Griffiths, Technical Fellow at endjin, shares insights and techniques to overcome Spark's challenges with numerous small files, including parallelizing file discovery and optimizing data loading. Follow along for detailed steps and tips to significantly enhance your Spark data processing workflows using Apache Spark in Microsoft Fabric.

00:00 Introduction to Performance Improvement in Apache Spark
00:20 Understanding the Problem with Small Files in Spark
00:38 Our Scenario: Performance Telemetry Collection
01:20 Initial Approach and Disappointment
01:40 Exploring the Root Cause
05:27 Parallelization: The Key to Performance Boost
08:51 Implementing the Solution in Spark
12:43 Conclusion: Balancing Complexity and Performance

Published on: June 11, 2024

endjin.com

We help small teams achieve big things.

Share post:

Related posts

Fabric Lab 03 Use delta tables in Apache Spark by taik18

In this video, @taik18 details the use of delta tables in Apache Spark through the lens of Fabric Lab 03. As part of the @microsoftfabric esse...

2 years ago

Fabric Lab 02 Analyze Data using Apache Spark Notebook by taik18

2 years ago

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

If you are working with PySpark data frames and need to add a current date time column to your existing data, this blog post can help. The pos...

2 years ago

Ingest Data with Spark & Microsoft Fabric Notebooks | Learn Together

This is a video tutorial aimed at guiding learners through the process of data ingestion using Spark and Microsoft Fabric notebooks for seamle...

2 years ago

Ingest Data with Spark & Fabric Notebooks | Learn Together

This video tutorial is a great resource for those looking to learn how to ingest data with Spark and Microsoft Fabric notebooks. The tutorial ...

2 years ago

Connect Power BI and Spark notebooks with Microsoft Fabric Semantic Link

The new Semantic Link feature in Microsoft Fabric is creating quite a buzz in the world of data analytics. With this feature, it is now possib...

2 years ago

Loading Files with Spark Structured Streaming in Microsoft Fabric

If you're working with Spark Structured Streaming in Microsoft Fabric, this article is a must-read. Here, you'll dive into the process of load...

2 years ago

Synapse - Choosing Between Spark Notebook vs Spark Job Definition

Author(s): Arun Sethia and Arshad Ali are Program Managers in Azure Synapse Customer Success Engineering (CSE) team. Introduction Apac...

3 years ago

Writing data using Azure Synapse Dedicated SQL Pool Connector for Apache Spark

Summary A common data engineering task is explore, transform, and load data into data warehouse using Azure Synapse Apache Spark....

3 years ago

Mastering DP-500 Exam: Explore data using Spark notebooks!

If you're prepping for the DP-500 Exam or just looking for an easy way to visualize your data, Synapse Analytics Spark pool has got you covere...

4 years ago

endjin.com

We help small teams achieve big things.

Learn more

More from this blog

Introducing Corvus.Text.Json V5: JsonLogic - Safe Business Rules

Corvus.Text.Json.JsonLogic provides a safe, side-effect-free rule engine that evaluates JSON-encoded...

Introducing Corvus.Text.Json V5: JMESPath - On Average 28× Faster JSON Queries

Corvus.Text.Json.JMESPath implements the full JMESPath spec with 100% conformance, zero-allocation h...

Introducing Corvus.Text.Json V5: JSONata - Query and Transform JSON

Corvus.Text.Json.Jsonata brings the full JSONata language to .NET - 100% test suite conformance, on ...

Introducing Corvus.Text.Json V5: Standalone Evaluator and Annotations

The standalone evaluator generates a lightweight validator with fully compliant JSON Schema annotati...

Introducing Corvus.Text.Json V5: Mutable Documents

JsonDocumentBuilder and JsonWorkspace provide pooled, version-tracked mutable documents - the core V...

Introducing Corvus.Text.Json V5: Pooled-Memory Parsing

ParsedJsonDocument uses ArrayPool-backed memory for just 136 bytes of GC pressure per document - 91%...

Introducing Corvus.Text.Json V5: Schema Validation - 10× Faster

Corvus.Text.Json V5 validates JSON against all major schema drafts over 10× faster than other .NET v...

Introducing Corvus.Text.Json V5: Source-Generated Types

Annotate a partial struct with [JsonSchemaTypeGenerator] and get strongly-typed properties, validati...

Introducing Corvus.Text.Json V5: Why V5 Exists

Corvus.Text.Json V5 is a new engine for high-performance JSON in .NET - pooled-memory parsing, mutab...

Optimising DAX: VertiPaq Encoding Techniques

VertiPaq fits millions of rows in memory by compressing columns. Learn how value, hash and run-lengt...

Relevant topics:

Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!

* Yes, I agree to the privacy policy