Loading...

Evaluating RAG Applications with AzureML Model Evaluation

Evaluating RAG Applications with AzureML Model Evaluation

Why Do we Need RAG? 

RAG stands for Retrieval-Augmented Generation, which is a technique for improving the accuracy and reliability of generative AI models with facts retrieved from external sources. RAG is useful for tasks that require generating natural language responses based on relevant information from a knowledge base, such as question answering, summarization, or dialogue. RAG combines the strengths of retrieval-based and generative models, allowing AI systems to access and incorporate external knowledge into their output without the need to retrain the model. 

In the rapidly changing domain of large language models (LLMs), RAG provides an easy way to customize the LLMs based on your needs and the increasing availability of knowledge. RAG is an innovative and cost-effective approach to leverage the power of large language models and augment them with domain-specific or organizational knowledge. RAG can improve the quality, diversity, and credibility of the generated text and provide traceability for the information sources. 

 

What is the need for RAG Evaluation? 

A RAG pipeline is a system that uses external data to augment the input of a large language model (LLM) for tasks such as question answering, summarization, or dialogue. RAG evaluation is the process of measuring and improving the performance of this RAG pipeline. Some of the reasons why RAG evaluation is needed are: 

  1. To assess the quality, diversity, and credibility of the generated text, as well as the traceability of the sources of information. 
  2. To identify the strengths and weaknesses of the retrieval and generation components of the RAG pipeline and optimize them accordingly. 
  3. To compare different RAG techniques, models, and parameters, and select the best ones for a given use case. 
  4. To ensure that the RAG pipeline meets the requirements and expectations of the end users, and does not produce harmful or misleading outputs. 

Why AzureMLModel Evaluation? 

AzureML Model Evaluation serves as an all-encompassing hub, providing a unified and streamlined evaluation experience across a diverse spectrum of Curated LLMs, task(s), and data modalities. Platform offers highly contextual, task-specific Metrics complemented by Intuitive Metrics and Chart Visualization empowering users to assess the quality of their models and predictions. 

AzureML Model Evaluation delivers a versatile experience, offering both an intuitive User Interface (UI) and a powerful Software Development Kit (SDK) a.k.a. azureml-metrics sdk. 

In this blog, we will be focusing on the SDK flow using azureml-metrics package. 

 

_aditisingh_0-1712550349103.png

 

Architecture and Flow of AzureML Model Evaluation for a RAG scenario 

 

Lets take a look at GPT-Graded metrics and explore what they measure - 

_aditisingh_1-1712550349106.png

GPT Graded metrics for RAG evaluation 

 

In addition to these GPT Graded metrics, azureml-metrics framework offers traditional Supervised ML metrics like BLEU and Rouge as well as  Model-graded Unsupervised metrics such as perplexity and BLEURT which measure the quality of the generated text based on reference text or a probabilistic model. 

_aditisingh_2-1712550349108.png

Model based metrics 

 

Here's a quick overview of the AzureML-Metrics SDK implementation to evaluate Multi-turn RAG scenario, offering a glimpse into its capabilities and functionalities.

_aditisingh_3-1712550349114.png

RAG Evaluation (azureml-metrics) for a Multi-turn chat-completion task  

 

Lets take a step by step look at how to use AzureML Metrics for RAG Evaluation Task? (Task-based Evaluation) 

Step 1: Install azureml-metrics package using the following command: 

 

pip install azureml-metrics[all]

 

 

Step 2: Import the required functionalities from the azureml-metrics package, such as compute_metrics, list_metrics, and list_prompts. For example: 

 

from azureml.metrics import compute_metrics, list_metrics, list_prompts, constants

 

The azureml-metrics package offers a 'task-based' classification of metrics, highlighting its ability to compute metrics specific to given tasks. 
This SDK provides pre-defined mappings of relevant metrics for evaluating various tasks. This feature proves invaluable for citizen users who may be uncertain about which evaluation quality metrics are appropriate for effectively assessing their tasks. 

The compute_metrics function is the main scoring function that computes the scores for the given data and task. 

To see the list of metrics associated with a task, you can use the list_metrics function, and pass in the task type as RAG_EVALUATION to consider both information retrieval and content generation for context-aware multi-turn conversations: 

 

rag_metrics = list_metrics(task_type=constants.Tasks.RAG_EVALUATION) print("RAG Evaluation based metrics:", rag_metrics)

 

_aditisingh_4-1712550349115.png

Output to list_metrics function for task RAG Evaluation 

 

To view the prompts for these GPT* metrics, you can use the list_prompts function: 

 

coherence_prompt = list_prompts( task_type=constants.Tasks.QUESTION_ANSWERING,                                 metric="gpt_coherence") print(coherence_prompt)

 

_aditisingh_5-1712550349116.png

Output to list_prompt function for Question Answering task 

 

Step 3: Preprocess the data for the RAG task. The data should be in the form of a conversation, a list of all multi-turn conversations. 

 

[   [{"role": "user", "content": question},    {"role": "assistant", "content": pred,     "context": {         "citations": citation         }     }],   [{"role": "user", "content": question},    {"role": "assistant", "content": pred,     "context": {         "citations": citation         }     }], ]

 

 

For example, you can create a list of processed predictions as follows: 

 

questions = list(data["question"]) y_pred = list(data["answer"]) contexts = list(data["search_intents"]) citations = list(data["context"]) processed_predictions = [] for question, pred, context, citation in zip(questions, y_pred, contexts, citations):     current_prediction = [         {"role": "user", "content": question},         {"role": "assistant", "content": pred,         "context": {             "citations": citation             }         }     ]     processed_predictions.append(current_prediction)

 

 

Step 4: Now we can compute the RAG metrics using out-of-the-box user facing compute_metrics function.  

 

metrics_config = {     "openai_params": openai_params,     "score_version": "v1",     "use_chat_completion_api": True,     "metrics":["gpt_relevance", "gpt_groundedness", "gpt_retrieval_score"], } result = compute_metrics(task_type=constants.Tasks.RAG_EVALUATION,                          y_pred=processed_predictions,                          **metrics_config)

 

 

The following output showcases the results generated by the compute_metrics function. 

_aditisingh_6-1712550349121.png

Aggregated RAG Metrics 

 

Azureml-metrics SDK also allows users to view the reasoning / explanation behind choosing a specific score for each turn. 

_aditisingh_7-1712550349129.png

 

How azureml-metrics sdk can help users pick the best model suitable for their ML scenario: 

To demonstrate the usefulness and effectiveness of the azureml-metrics package, a short comparative analysis was done on different models using GPT graded metrics from our package. The dataset used is a subset of the Truthful QA dataset, which serves as a benchmark for evaluating the truthfulness of language models in generating responses to questions within a single-turn RAG chat conversation scenario. In this experiment, we can see how GPT Graded metrics can be used to assess the best model in given user scenario. 

_aditisingh_8-1712550349131.png

Comparative analysis of model performance using RAG metrics for Truthful QA dataset 

 

From this graph, we can observe that the gpt-4 model has the highest GPT Coherence, GPT Fluency scores and GPT Similarity scores. However, best model choice depends on the specific requirements of user’s task. For instance, if you need a model that is more coherent and fluent, gpt-4 would be a better choice. However, if you need an open-source model with performance close to gpt-4, llama-70b-chat would be the next best option. 
 Similar to above single-turn conversation data, one can use these metrics to evaluate different models on multi-turn RAG conversations and identify the best model for your use case. 

 

References 

 

Published on:

Learn more
Azure Developer Community Blog articles
Azure Developer Community Blog articles

Azure Developer Community Blog articles

Share post:

Related posts

Episode 395 – Getting Started with VDI in Azure with Azure Virtual Desktop

Welcome to Episode 395 of the Microsoft Cloud IT Pro Podcast. In this episode, we dive into Azure Virtual Desktop (AVD) and how it enables org...

52 minutes ago

Azure AI Agents are in the Public Preview

AI agents are becoming a key enabler for businesses looking to streamline processes, automate repetitive tasks, and empower their employees to...

19 hours ago

Simplify your .NET data transfers with the new Azure Storage Data Movement library

This post announces the new and improved Azure Storage Data Movement library for .NET. The post Simplify your .NET data transfers with the new...

22 hours ago

External REST Endpoint Invocation in Azure SQL Managed Instance is now in Public Preview

External REST Endpoint Invocation is available in Azure SQL Managed Instance with the Always-up-to-date update policy configured. Call Azure S...

1 day ago

Announcing the end of support for Node.js 18.x in the Azure SDK for JavaScript

After July 10, 2025, the Azure SDK for JavaScript will no longer support Node.js 18.x. Upgrade to an Active Node.js Long Term Support (LTS) ve...

1 day ago

February Patches for Azure DevOps Server

Today we are releasing patches that impact our self-hosted product, Azure DevOps Server. We strongly encourage and recommend that all customer...

1 day ago

Primer: Using Exchange Online PowerShell in Azure Automation Runbooks

In this primer, we cover how to create and execute Azure Automation Exchange Online runbooks (scripts) using cmdlets from the Exchange Online ...

3 days ago

Azure Developer CLI (azd) – February 2025

This post announces the February release of the Azure Developer CLI (`azd`). The post Azure Developer CLI (azd) – February 2025 appeared...

5 days ago

Using Azure AI Foundry SDK for your AI apps and agents

Design, customize and manage your own custom applications with Azure AI Foundry right from your code. With Azure AI Foundry, leverage over 1,8...

6 days ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy