Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations

This tutorial delves into the intricacies of data validation in the realm of Microsoft Fabric and Great Expectations. It demonstrates how a data contract can be established in Microsoft Fabric to set minimum standards for data quality in a pipeline, while also showcasing how bad rows can be elegantly dropped. Through this demo, the use of Fabrics' Teams Pipeline Activity and Great Expectations Python Package to identify validation errors and send messages to data stewards has been highlighted. The tutorial uses the popular Kaggle Titanic data set and includes a deep dive into Notebooks, Pipelines, and the Lakehouse in Fabric engineering experience while adopting Medallion Architecture and DataOps practices. This video is the second in a series of videos that will together create an end-to-end demo of Microsoft Fabric.

Chapters:

00:12 Overview of the architecture
00:36 Focus on processing data to Silver
00:55 Application of DataOps principles to data validation and alerting
02:19 Tour of the artefacts in the Microsoft Fabric workspace
02:56 Open the "Validation Location" notebook and viewing the contents
03:30 Inspect the reference data that is going to be validated by the notebook
05:14 Overview of the key stages in the notebook
05:39 Set up the notebook, using %run to establish utility functions
06:21 Set up a "data contract" using great expectations package
07:45 Load the data from the Bronze area of the lake
08:18 Validate the data by applying the "data contract" to it
08:36 Remove any bad records to create a clean data set
09:04 Write the clean data to the lakehouse in Delta format
09:52 Exit the notebook using mssparkutils to pass back validation results
10:53 Lineage is used to discover the pipeline that triggers it
11:01 Exploring the "Process to Silver" pipeline
11:35 Configuration of an "If Condition" to process the notebook exit value
11:56 Setting up a Teams pipeline activity to notify users
12:51 Populating the title and body of Teams message with dynamic information
13:28 Information about the next episode

Additional videos in this series:

The tutorial is the second in a series of videos that will together create an end-to-end demo of Microsoft Fabric, From Descriptive to Predictive Analytics with Microsoft Fabric. Find out more about the other videos in this sequence through the link below.

The post Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations originally appeared on Endjin.

Published on: May 07, 2024

Learn more

endjin.com

We help small teams achieve big things.

endjin.com

We help small teams achieve big things.

Learn more

Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations

Chapters:

Additional videos in this series:

Related posts

Demystifying Delta Lake Table Structure in Microsoft Fabric

From Descriptive to Predictive Analytics with Microsoft Fabric | Part 1

The 4 Main Types of Data Analytics

OneLake: Microsoft Fabric’s Ultimate Data Lake

Data Modeling for Mere Mortals – Part 4: Medallion Architecture Essentials

What is Microsoft Fabric? Full-Service Data Analytics

40 Days of Fabric: Day 4 – Direct Lake

Data validation in Python: a look into Pandera and Great Expectations

Power Platform self-service analytics Data Export to Data Lake [Preview] | Power Platform Admin Center

Dealing with ParquetInvalidColumnName error in Azure Data Factory