Custom RAG solution on podcast data

Large Language Models (LLMs) have become a hot topic in the tech world. For those of us in Belgium (or the Netherlands) with a passion for technology and science, one of the go-to resources is the monthly podcast "Nerdland," which delves into a variety of subjects from bioscience and space exploration to robotics and artificial intelligence.

Recognizing the wealth of knowledge contained in over 100 episodes of "Nerdland," A simple thought came to our minds: why not develop a chatbot tailored for "Nerdland" enthusiasts? This chatbot would leverage the content of the podcasts to engage and inform users. Want to know more of the project with Nerdland? Visit aka.ms/nerdland

This chatbot enables the Nerdland community to interact at another level with the Nerdland content. On top of that, it democratizes the wealth of information in all these podcasts. Since the podcast is in Ducth, the audience around the world is quite limited. Now we can expose this information in dozens of languages, cause these LLMs are capable of multi-language conversations out of the box.

In this blog post, I'll explore the technical details and architecture of this exciting project. I'll discuss the LLMs utilized, the essential components, the process of integrating podcast content into the chatbot, and the deployment strategy used to ensure the solution is scalable, secure, and meets enterprise standards on Azure.

RAG Principles

Upon delving into this article, it's likely you've experimented with chatGPT or other Generative AI (GenAI) models. You may have observed two notable aspects:

Large Language Models (LLMs) excel at crafting responses that are both articulate and persuasive.
However, the content provided by an LLM might be entirely fictional, a phenomenon known as "hallucinations."

LLMs are often employed for data retrieval, offering a user-friendly interface through natural language. To mitigate the issue of hallucinations, it's crucial to ground the LLM's responses in a specific dataset, such as one you privately own. This grounding ensures that the LLM's outputs are based on factual information.

To utilize LLMs with your proprietary data, a method called Retrieval Augmented Generation (RAG) is used. RAG combines the natural language prowess of LLMs with data they haven't been explicitly trained on.

Several components are essential for this process:

Indexing: Your data must be organized in a way that allows for easy retrieval. Indexing structures your data into "documents," making key data points readily searchable.
Depending on your data size, you may need to split it into smaller pieces before indexing. Reason being that LLMs typically only allow a certain amount of input tokens (for reference: GPT3.5 turbo allows 4096 tokens, the newest GPT4o allows up to 128’000 input tokens). Given that LLMs have a finite context window and the amount of tokens has a cost, chunking optimizes this token usage. Typically, data is chunked in increments (e.g., 1024 characters), although the optimal size may vary depending on your data.

Intent Recognition: The user's query is processed by an LLM to extract the "intent," a condensed version of the original question. Querying the index with this intent often produces more relevant results than using the full prompt. The index is then searched using the intent, yielding the top n documents.

Once the relevant documents are identified, they are fed into a Large Language Model. The LLM then crafts a response in natural language, drawing upon the information from these documents. It ensures that the answer is not only coherent but also traces back to the original data sources indexed, maintaining a connection to the factual basis of the information.

Keyword matching is a fundamental aspect of data retrieval, yet it has its limitations. For instance, a search for "car" might yield numerous references to "cars" but overlook related terms like "vehicles" due to the lack of direct word correlation. To enhance search capabilities, it's not just exact keyword matches that are important, but also the identification of words with similar meanings.

This is where vector space models come into play. By mapping words into a vector space, "car" and "vehicle" can be positioned in close proximity, indicating their semantic similarity, while "grass" would be positioned far from both. Such vector representations significantly refine the search results by considering semantically related terms.

Embedding models are the tools that facilitate the translation of words into their vector counterparts. These pretrained models, such as the ADA model, encode words into vectors.

Integrating vector search with the Retrieval Augmented Generation (RAG) model introduces additional steps:

Initially, our index is comprised solely of textual documents. To enable vector-based searching, our data must also be vectorized using the pretrained embedding models. These vectors are then indexed, transforming our textual index into a vector database.

Incoming queries are likewise converted into vectors through an embedding model. This conversion allows for a dual search approach within our vector database, leveraging both keyword and vector similarities.

Finally, the top 'n' documents from the index are processed by the LLM, which synthesizes the information to generate a coherent, natural language response.

RAG for Nerdland Assistant

Our Nerdland Assistant is a Retrieval Augmented Generation (RAG) chatbot, uniquely crafted to harness the rich content from the podcast archives. To achieve this, we've combined a suite of Azure components, each serving a distinct purpose in the chatbot's architecture:

Container Apps: These are utilized to host custom logic in the form of containers in a serverless way, ensuring both cost-effectiveness and scalability.
Logic Apps: These facilitate the configuration of workflows, streamlining the process with ease and efficiency.
Azure OpenAI: This serves as a versatile API endpoint, granting access to a range of OpenAI models including ChatGPT4, ADA, and others.
AI Search: At the core of our chatbot is an index/vector database, which enables the sophisticated retrieval capabilities necessary for the RAG model.
Storage Accounts: A robust storage solution is essential for housing our extensive podcast library, ensuring that the data remains accessible and secure.

The journey of the Nerdland Assistant episode begins when a new MP3 file is uploaded to an Azure storage account. This triggers a series of automated workflows.
A Logic App is triggered, instructing a Container App to convert the stereo MP3 to mono format, a requirement for the subsequent speech-to-text conversion.
Another Logic App initiates the OpenAI Whisper transcription batch API, which processes the MP3s with configurations such as language selection and punctuation preferences.
A third Logic App monitors the transcription progress and, upon completion, stores the results back in the storage account.
A fourth Logic App calls upon a Python Scrapy framework-based Container App to scrape additional references from the podcast's shownotes page.
The final Logic App sets off our custom indexer, hosted in a Container App, which segments the podcast transcripts into smaller chunks and uploads them to Azure AI.
- Each chunk is then crafted into an index document, enriched with details like the podcast title, episode sponsor, and scraped data.
- These documents are uploaded to Azure AI Search, which employs the ADA model to convert the text into vector embeddings, effectively transforming our index into a vector database.
- While Azure AI Search is our chosen platform, alternatives like Azure Cosmos DB, Azure PostgreSQL, or Elastic could also serve this purpose.
Azure AI Search harnesses the ADA model to automatically convert textual documents into vector embeddings, through configuration. For those seeking alternatives, options include Azure Cosmos DB, Azure PostgreSQL, and Elasticsearch, among others.
With our vector database now established atop our podcast episodes, we're ready to implement Retrieval Augmented Generation (RAG). There are several approaches to this, such as manual implementation, using libraries like LangChain or Semantic Kernel, or leveraging Azure OpenAI APIs and Microsoft OpenAI SDKs.

The choice of method depends on the complexity of your project. For more complex systems involving agents, plugins, or multimodal solutions, LangChain or Semantic Kernel might be more suitable. However, for straightforward applications like ours, the Azure OpenAI APIs are an excellent match.
We've crafted our own RAG backend using the Azure OpenAI APIs, which simplifies the process by handling all configurations in a single stateless request. This API abstracts much of the RAG's complexity and requires the following parameters:

LLM Selection: For the Nerdland Copilot, we're currently utilizing GPT-4o.
- Embedding Model: Such as ADA, to vectorize the input.
- Parameters: These include settings like temperature (to adjust the creativity of the LLM's responses) and strictness (to limit responses to the indexed data).
- Vector Database: In our case, this is the AI Search, which contains our indexed data.
- Document Retrieval: The number of documents the vector database should return in response to a query.
- System Prompt: This provides additional instructions to the LLM on the desired tone and behavior, such as "answer informally and humorously, act like a geeky chatbot."
- User Prompt: The original question posed by the user.
The backend we created in step 9, deployed on a Container App, is abstracted by an API Management layer which enhances security, controls the data flow, and offers potential enhancements like smart caching and load balancing for OpenAI.
To maintain a record of chat interactions, we've integrated a Redis cache that captures chat history via session states, effectively archiving the chat history of the Large Language Model (LLM). This server-side implementation ensures that the system prompts remain secure from any end-user modifications.
The final touch to our backend is its presentation through a React frontend hosted on an Azure Static Web App. This interface not only provides a seamless user experience but also offers the functionality for users to view and interact with the sources referenced in each LLM-generated response.

This entire setup is fully scripted as Infrastructure as Code. We utilize Bicep and the Azure Developer CLI to template the architecture, ensuring that our solution is both robust and easily replicable.

LLM Configuration

The quality of the answers by the LLM are significantly shaped by several factors: the system prompts, LLM parameters (such as temperature, max tokens, ..), chunk size, and the indexing method's robustness.

Every single of the above parameters influences the outcome by a lot. The only way to improve the outcome is to assess the results by influencing the parameters. To structure this process, you can make use of PromptFlow. PromptFlow allows you to repeat this LLM tweaking process, keeping track of the quality of the results per configuration.

Responsible AI

When deploying an application that is making use of generative AI; adhering to Responsible AI Principles is crucial. These Microsoft principles guide the ethical and safe use of AI technologies. https://learn.microsoft.com/en-us/legal/cognitive-services/openai/overview

A key advantage of utilizing Azure OpenAI endpoints is the built-in safety content filters provided by Microsoft. These filters function by processing both the input prompts and the generated outputs through a sophisticated array of classification models. The goal is to identify and mitigate the risk of producing any content that could be deemed harmful.

Future of the project and GenAI

Triggered by the above? Feel free to explore yourself on: github.com/azure/nerdland-copilot.

The journey of developing this custom Assistantwas a time-constrained endeavor that laid the groundwork for a basic yet functional system. The possibilities for expansion and enhancement are endless, with potential future enhancements including:

Integration of models like GPT-4o: Enabling speech-based interactions with the bot, offering a more dynamic and accessible user experience.
Data enrichment: Incorporating a broader spectrum of (external) data to enrich the chatbot's knowledge base and response accuracy.
Quality optimization: Embedding LLMOps (for example with: PromptFlow) into the application's core to fine-tune the LLM's performance, coupled with leveraging real-time user feedback for continuous improvement.
Incorporating graph libraries would enable the AI to present answers that are not only informative but also visually compelling, particularly for responses that involve statistical analysis. This would make data interpretation more intuitive for users.
Embracing the adage that "a picture is worth a thousand words," integrating the ability for the AI to communicate through images and videos could improve the way we interact with the technology.
The concept of creating a podcast from scratch by combining various topics is an exciting prospect. It would allow for the generation of unique and diverse content, tailored to the interests and preferences of the individual. A possible way of achieving this might be with the use of “agents”. An agent being a specific task for an LLM (one specific model/prompt/..) This requires a setup where multiple “agents” work together.