Loading...

Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations

Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations

This tutorial delves into the intricacies of data validation in the realm of Microsoft Fabric and Great Expectations. It demonstrates how a data contract can be established in Microsoft Fabric to set minimum standards for data quality in a pipeline, while also showcasing how bad rows can be elegantly dropped. Through this demo, the use of Fabrics' Teams Pipeline Activity and Great Expectations Python Package to identify validation errors and send messages to data stewards has been highlighted. The tutorial uses the popular Kaggle Titanic data set and includes a deep dive into Notebooks, Pipelines, and the Lakehouse in Fabric engineering experience while adopting Medallion Architecture and DataOps practices. This video is the second in a series of videos that will together create an end-to-end demo of Microsoft Fabric.

Chapters:

  • 00:12 Overview of the architecture
  • 00:36 Focus on processing data to Silver
  • 00:55 Application of DataOps principles to data validation and alerting
  • 02:19 Tour of the artefacts in the Microsoft Fabric workspace
  • 02:56 Open the "Validation Location" notebook and viewing the contents
  • 03:30 Inspect the reference data that is going to be validated by the notebook
  • 05:14 Overview of the key stages in the notebook
  • 05:39 Set up the notebook, using %run to establish utility functions
  • 06:21 Set up a "data contract" using great expectations package
  • 07:45 Load the data from the Bronze area of the lake
  • 08:18 Validate the data by applying the "data contract" to it
  • 08:36 Remove any bad records to create a clean data set
  • 09:04 Write the clean data to the lakehouse in Delta format
  • 09:52 Exit the notebook using mssparkutils to pass back validation results
  • 10:53 Lineage is used to discover the pipeline that triggers it
  • 11:01 Exploring the "Process to Silver" pipeline
  • 11:35 Configuration of an "If Condition" to process the notebook exit value
  • 11:56 Setting up a Teams pipeline activity to notify users
  • 12:51 Populating the title and body of Teams message with dynamic information
  • 13:28 Information about the next episode

Additional videos in this series:

The tutorial is the second in a series of videos that will together create an end-to-end demo of Microsoft Fabric, From Descriptive to Predictive Analytics with Microsoft Fabric. Find out more about the other videos in this sequence through the link below.

The post Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations originally appeared on Endjin.

Published on:

Learn more
endjin.com
endjin.com

We help small teams achieve big things.

Share post:

Related posts

Demystifying Delta Lake Table Structure in Microsoft Fabric

If you're wondering about the structure of Delta Lake tables in OneLake for the Lakehouse, this article and video are here to demystify it for...

6 months ago

From Descriptive to Predictive Analytics with Microsoft Fabric | Part 1

This article provides a comprehensive overview of an end-to-end demo of Microsoft Fabric's predictive analytics capabilities using the Kaggle ...

6 months ago

The 4 Main Types of Data Analytics

It's no secret that data analytics is the backbone of any successful operation in today's data-rich world. That being said, did you know that ...

9 months ago

OneLake: Microsoft Fabric’s Ultimate Data Lake

Microsoft Fabric's OneLake is the ultimate solution to revolutionizing how your organization manages and analyzes data. Serving as your OneDri...

1 year ago

Data Modeling for Mere Mortals – Part 4: Medallion Architecture Essentials

If you're a mere mortal trying to grasp the nuances of data modeling, you've come to the right place. In this fourth and final part of the ser...

1 year ago

What is Microsoft Fabric? Full-Service Data Analytics

Microsoft Fabric is a revolutionizing platform that has the capability to analyze data and give meaningful insights which is its one-stop-shop...

1 year ago

40 Days of Fabric: Day 4 – Direct Lake

Day 4 of the 40 Days of Fabric series explores Direct Lake and its benefits over traditional DirectQuery or Import storage modes in Microsoft ...

1 year ago

Data validation in Python: a look into Pandera and Great Expectations

Data validation is a vital step in any data-oriented workstream. This post investigates and compares two popular Python data validation packag...

1 year ago

Power Platform self-service analytics Data Export to Data Lake [Preview] | Power Platform Admin Center

Here's how you can extract self-service Power Platform Analytics Data to Azure Data Lake! [Preview]

1 year ago

Dealing with ParquetInvalidColumnName error in Azure Data Factory

Azure Data Factory and Integrated Pipelines within the Synapse Analytics suite are powerful tools for orchestrating data extraction. It is a c...

2 years ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy