Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations
This tutorial delves into the intricacies of data validation in the realm of Microsoft Fabric and Great Expectations. It demonstrates how a data contract can be established in Microsoft Fabric to set minimum standards for data quality in a pipeline, while also showcasing how bad rows can be elegantly dropped. Through this demo, the use of Fabrics' Teams Pipeline Activity and Great Expectations Python Package to identify validation errors and send messages to data stewards has been highlighted. The tutorial uses the popular Kaggle Titanic data set and includes a deep dive into Notebooks, Pipelines, and the Lakehouse in Fabric engineering experience while adopting Medallion Architecture and DataOps practices. This video is the second in a series of videos that will together create an end-to-end demo of Microsoft Fabric.
Chapters:
- 00:12 Overview of the architecture
- 00:36 Focus on processing data to Silver
- 00:55 Application of DataOps principles to data validation and alerting
- 02:19 Tour of the artefacts in the Microsoft Fabric workspace
- 02:56 Open the "Validation Location" notebook and viewing the contents
- 03:30 Inspect the reference data that is going to be validated by the notebook
- 05:14 Overview of the key stages in the notebook
- 05:39 Set up the notebook, using %run to establish utility functions
- 06:21 Set up a "data contract" using great expectations package
- 07:45 Load the data from the Bronze area of the lake
- 08:18 Validate the data by applying the "data contract" to it
- 08:36 Remove any bad records to create a clean data set
- 09:04 Write the clean data to the lakehouse in Delta format
- 09:52 Exit the notebook using mssparkutils to pass back validation results
- 10:53 Lineage is used to discover the pipeline that triggers it
- 11:01 Exploring the "Process to Silver" pipeline
- 11:35 Configuration of an "If Condition" to process the notebook exit value
- 11:56 Setting up a Teams pipeline activity to notify users
- 12:51 Populating the title and body of Teams message with dynamic information
- 13:28 Information about the next episode
Additional videos in this series:
The tutorial is the second in a series of videos that will together create an end-to-end demo of Microsoft Fabric, From Descriptive to Predictive Analytics with Microsoft Fabric. Find out more about the other videos in this sequence through the link below.
The post Microsoft Fabric Machine Learning Tutorial - Part 2 - Data Validation with Great Expectations originally appeared on Endjin.
Published on:
Learn moreRelated posts
Demystifying Delta Lake Table Structure in Microsoft Fabric
If you're wondering about the structure of Delta Lake tables in OneLake for the Lakehouse, this article and video are here to demystify it for...
From Descriptive to Predictive Analytics with Microsoft Fabric | Part 1
This article provides a comprehensive overview of an end-to-end demo of Microsoft Fabric's predictive analytics capabilities using the Kaggle ...
The 4 Main Types of Data Analytics
It's no secret that data analytics is the backbone of any successful operation in today's data-rich world. That being said, did you know that ...
OneLake: Microsoft Fabric’s Ultimate Data Lake
Microsoft Fabric's OneLake is the ultimate solution to revolutionizing how your organization manages and analyzes data. Serving as your OneDri...
Data Modeling for Mere Mortals – Part 4: Medallion Architecture Essentials
If you're a mere mortal trying to grasp the nuances of data modeling, you've come to the right place. In this fourth and final part of the ser...
What is Microsoft Fabric? Full-Service Data Analytics
Microsoft Fabric is a revolutionizing platform that has the capability to analyze data and give meaningful insights which is its one-stop-shop...
40 Days of Fabric: Day 4 – Direct Lake
Day 4 of the 40 Days of Fabric series explores Direct Lake and its benefits over traditional DirectQuery or Import storage modes in Microsoft ...
Data validation in Python: a look into Pandera and Great Expectations
Data validation is a vital step in any data-oriented workstream. This post investigates and compares two popular Python data validation packag...
Power Platform self-service analytics Data Export to Data Lake [Preview] | Power Platform Admin Center
Here's how you can extract self-service Power Platform Analytics Data to Azure Data Lake! [Preview]
Dealing with ParquetInvalidColumnName error in Azure Data Factory
Azure Data Factory and Integrated Pipelines within the Synapse Analytics suite are powerful tools for orchestrating data extraction. It is a c...