Building RAG on Phi-3 locally using embeddings on VS Code AI Toolkit

In the previous tutorial we created embeddings and added them to the opensource vector database ChromaDB. This is one of the prerequisites for creating any retrieval augmented generation (RAG) application. If you want to follow the steps for using embeddings, please take a look at the earlier part to follow along.

Since the database is already created in the earlier tutorial, let us now connect it to Phi-3 using the AI toolkit. AI toolkit enables us to create an endpoint which will help in creating easier API calls. We can utilize the model on our local machine and it can be done completely offline. This uses a concept called as Port forwarding. An earlier blog in this series covered it.

Small language models (SLMs) are language models with a smaller computational and memory footprint, smaller parameter count and are typically have lower response latency compared to LLMs. They can be used to do efficient on-device processing (like mobile phones and edge devices). They are easier to train and adapt to specific specialized domain and also popular in cases where sensitive data needs to be handled and privacy/security are paramount concerns. Phi, is a family of open AI SLMs developed by Microsoft. You can learn more about the Phi-3 model in detail using this excellent cookbook.

Also all the code available for this tutorial is available in the Azure Samples Repository

We well develop one basic chat application, which enable the Phi-3 SLM to communicate with the Vector DB alone and answer the user questions. This will be done in two steps.

Create the basic application workflow
Use streamlit and convert into a webapp.

Basic python knowledge would be needed to understand the code flow. Let’s begin by importing the required libraries.

import streamlit as st from langchain_openai import ChatOpenAI from langchain_community.vectorstores import Chroma from langchain_community.embeddings import SentenceTransformerEmbeddings from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnableParallel, RunnablePassthrough

1. Streamlit

Module: Streamlit
Purpose: Streamlit is a Python library used to create interactive web applications. It is particularly popular for data science and machine learning applications.
Usage: The st alias is used to access Streamlit's functions.

ChatOpenAI:
- Module: langchain_openai
- Purpose: ChatOpenAI is a class that provides an interface to interact with OpenAI's language models. It allows you to send queries to the model and receive responses.
- Usage: Used to initialize and configure the OpenAI model for generating responses based on user input.
Chroma:
- Module: langchain_community.vectorstores
- Purpose: Chroma is a vector store that allows you to store and retrieve high-dimensional vectors. It is used to store embeddings of documents or text and retrieve them based on similarity searches.
- Usage: Typically used in applications that require efficient similarity searches, such as document retrieval or question-answering systems.
SentenceTransformerEmbeddings:
- Module: langchain_community.embeddings
- Purpose: SentenceTransformerEmbeddings provides a way to generate embeddings using models from the Sentence Transformers library. These embeddings are numerical representations of text that capture semantic meaning.
- Usage: Used to convert text into embeddings that can be stored in a vector store like Chroma for similarity searches.
StrOutputParser:
- Module: langchain_core.output_parsers
- Purpose: StrOutputParser is a class used to parse the output from the language model into a string format.
- Usage: Used to convert the raw output from the language model into a more usable string format for display or further processing.
ChatPromptTemplate:
- Module: langchain_core.prompts
- Purpose: ChatPromptTemplate is a class used to create and manage prompt templates for interacting with the language model. It allows you to define the structure and content of the prompts sent to the model.
- Usage: Used to create a consistent and structured prompt for querying the language model.
RunnableParallel and RunnablePassthrough:
- Module: langchain_core.runnables
- Purpose:
  - RunnableParallel: A class used to run multiple tasks in parallel. It is useful for performing concurrent operations, such as retrieving multiple pieces of information simultaneously.
  - RunnablePassthrough: A class that simply passes the input through without any modification. It can be used as a placeholder or in situations where no processing is needed.
- Usage: Used to manage and execute multiple tasks concurrently or to pass data through a pipeline without modification.

Once we have the libraries it’s time to initialize the embedding model and SLM.

Let’s first initialize the embedding model. This is necessary to convert text into numerical embeddings. As of now there are no embedding models on AI Toolkit, we can also utilize a direct embedding model from AI Toolkit once they will be available. So for now we can use the Hugging Face Embeddings or Sentence Transformer Embeddings. As we used Hugging Face Embeddings in the previous blog lets now try with Sentence Transformer Embeddings

embeddings = SentenceTransformerEmbeddings(model_name='all-MiniLM-L6-v2')

Module: langchain_community.embeddings
Purpose: This line initializes an embedding model using the SentenceTransformerEmbeddings class. The model specified is all-MiniLM-L6-v2, which is a pre-trained model from the Sentence Transformers library.
Usage: The embeddings object will be used to convert text into numerical embeddings. These embeddings capture the semantic meaning of the text and can be used for various tasks such as similarity searches, clustering, or feeding into other machine learning models.

ChatOpenAI: Initializes a ChatOpenAI model with specific parameters, including a base URL for the API, an API key, a custom model name, and a temperature setting. This model is used to generate responses based on user queries.

model = ChatOpenAI( base_url="http://127.0.0.1:5272/v1/", api_key="ai-toolkit", model="Phi-3-mini-128k-directml-int4-awq-block-128-onnx", temperature=0.7 )

Parameters:

base_url = "http://127.0.0.1:5272/v1/": Specifies the base URL for the OpenAI API. In this case, it points to a local server running on 127.0.0.1 (localhost) at port 5272.
api_key = "ai-toolkit": The API key used to authenticate requests to the OpenAI API. In case of AI Toolkit usage we don’t have to specify any API key.
model="Phi-3-mini-128k-directml-int4-awq-block-128-onnx": Specifies the model to be used.
temperature=0.7: Sets the temperature parameter for the model, which controls the randomness of the output. A higher temperature results in more random responses, while a lower temperature makes the output more deterministic.

Retriever is a component of generative AI systems that enhances the quality and accuracy of responses by retrieving relevant information from a vast knowledge base.

Benefits of using a RAG retriever:

Improved accuracy: By providing relevant information, the retriever helps the AI model generate more accurate and informative responses.
Enhanced relevance: The retrieved context ensures that the generated responses are directly related to the user's query.
Factual correctness: The retriever can help prevent the AI model from generating incorrect or misleading information.

In essence, a RAG retriever acts as a bridge between the AI model and the world's knowledge, ensuring that the generated responses are both informative and relevant.

Now let’s initialize the vector database and also create a retriever object which will enable the app to search and query in the database.

In essence, a RAG retriever acts as a bridge between the AI model and the world's knowledge, ensuring that the generated responses are both informative and relevant. Now let’s initialize the vector database and also create a retriever object which will enable the app to search and query in the database

load_db=Chroma(persist_directory='./ai-toolkit',embedding_function=embeddings) retriever=load_db.as_retriever(search_kwargs={'k':3})

Initialize Chroma Vector Store:

Module: langchain_community.vectorstores
Purpose: This line initializes a Chroma vector store by loading it from the specified directory.
Parameters:
- persist_directory='./ai-toolkit': Specifies the directory where the vector store is saved. This should match the directory used when the vector store was initially created and saved.
- embedding_function=embeddings: The embedding model used to generate embeddings for the text. This should be the same embedding model used when the vector store was created.
Usage: The load_db object represents the loaded vector store, which contains the document embeddings and allows for efficient similarity searches.

Convert to Retriever:

Purpose: Converts the Chroma vector store into a retriever object.
Parameters:
- search_kwargs={'k': 3}: Specifies the search parameters for the retriever. In this case, k=3 means that the retriever will return the top 3 most similar documents for any given query.
Usage: The retriever object can be used to perform similarity searches on the vector store, retrieving the most relevant documents based on the query.

Once we have done this. Its time to now define the system message,

System message or metaprompt serves as the initial instruction or query that guides the model's response. It provides the context and sets the direction for the subsequent conversation.

Key components of a system message:

Task or Instruction: Clearly defines the desired outcome or action. For example, "Summarize the article on climate change."
Context: Provides relevant background information or context to help the model understand the query.
Constraints or Limitations: Specifies any specific requirements or restrictions on the response. For instance, "Keep the response concise and informative."

As we have the database around AI Toolkit, template is designed to guide the AI assistant's responses, ensuring they are relevant, professional, and focused on the Microsoft Visual Studio Code AI Toolkit. It provides a structured format for the AI to follow when generating responses.

template = """ You are a specialized AI assistant for the Microsoft Visual Studio Code AI Toolkit.\n Your responses should be strictly relevant to this product and the user's query. \n Avoid providing information that is not directly related to the toolkit. Maintain a professional tone and ensure your responses are accurate and helpful. Strictly adhere to the user's question and provide relevant information. If you do not know the answer then respond "I dont know".Do not refer to your knowledge base. {context} Question: {question} """

The above prompt covers the following aspects

Introduction: Sets the context for the AI assistant.
Relevance: Ensures responses are relevant to the toolkit and the user's query.
Avoid Irrelevant Information: Instructs the AI to avoid unrelated information.
Professional Tone: Ensures responses are professional, accurate, and helpful.
Adhere to User's Question: Instructs the AI to focus on the user's question.
Unknown Answers: Provides guidance on how to respond if the AI does not know the answer.

Besides this, we have the following parameters.

Context Placeholder:

Purpose: A placeholder for the context that will be provided when the template is used. This context will be dynamically filled in during execution.

Question Placeholder:

Purpose: A placeholder for the user's question that will be provided when the template is used. This question will be dynamically filled in during execution.

By using this structure, you we ensure that the SLM's response is relevant, informative, and aligned with the specific task or context you've provided. This template is a common component of LangChain.

It’s now time to pass the prompt template into respective parser.

prompt = ChatPromptTemplate.from_template(template) output_parser = StrOutputParser()

ChatPromptTemplate: The ChatPromptTemplate.from_template(template) method creates a prompt template object from the provided template string. This object can be used to format the template with specific context and questions.
StrOutputParser: The StrOutputParser object is initialized to parse the output from the AI model into a string format. This ensures that the raw output from the model is converted into a usable string format for display or further processing.

setup_and_retrieval = RunnableParallel( {"context": retriever, "question": RunnablePassthrough()} ) chain = setup_and_retrieval | prompt | model | output_parser

RunnableParallel: The RunnableParallel object is created to run multiple tasks in parallel. It retrieves relevant context using the retriever and passes the question through without modification using RunnablePassthrough.
Processing Chain: The chain object is created by combining the setup_and_retrieval, prompt, model, and output_parser components using the | operator. This represents the entire processing pipeline, where each component processes the input and passes the result to the next component.

These components work together to create a robust and efficient pipeline for processing user queries, retrieving relevant context, generating responses using the AI model, and parsing the output into a usable format.

That’s it, we have now finally completed the hard part, and let’s test it out!

resp=chain.invoke("What is Fine tuning") print(resp)

Invoke the Chain: The chain.invoke method is used to process the query "What is Fine tuning" through the entire chain, which includes context retrieval, prompt formatting, model response generation, and output parsing.

The response from the AI model is printed to the console.

App development with Streamlit :

As we now want to use it as a chat interface webapp, we are using the streamlit framework to implement it

Lets start with title,

st.title("AI Toolkit Chatbot") st.write("Ask me anything about the Microsoft Visual Studio Code AI Toolkit.")

Set the Title: The st.title function sets the title of the Streamlit web application to "AI Toolkit Chatbot", making it clear to users what the app is about.

Write a Description: The st.write function provides a brief description or introductory text, informing users that they can ask questions about the Microsoft Visual Studio Code AI Toolkit.

if 'messages' not in st.session_state: st.session_state.messages = [] for message in st.session_state.messages: with st.chat_message(message["role"]): st.markdown(message["content"])

Initialize User Session: Checks if the messages key exists in the Streamlit session state. If not, it initializes it as an empty list. This ensures that chat messages persist across user interactions.

Display Chat Messages: Iterates over the messages stored in the session state and displays each message in the chat interface using Markdown formatting. The role of the message (e.g., "user" or "assistant") is used to differentiate between user and assistant messages.

These steps help to create a persistent and interactive chat interface within the Streamlit web application, allowing users to see the history of their interactions with the AI assistant.

Finally, now create the input sections,

if user_input := st.chat_input("Your question:"): st.session_state.messages.append({"role": "user", "content": user_input}) with st.chat_message("user"): st.markdown(user_input) response = chain.invoke(user_input) st.session_state.messages.append({"role": "assistant", "content": response}) with st.chat_message("assistant"): st.markdown(response)

Capture User Input: Captures user input from a chat input box and stores it in the user_input variable.
Append User Message to Session State: Appends the user's message to the messages list in the Streamlit session state.
Display User Message: Displays the user's message in the chat interface using Markdown formatting.
Invoke the Processing Chain: Processes the user's input through the entire chain and stores the AI model's response in the response variable.
Append Assistant Message to Session State: Appends the assistant's response to the messages list in the Streamlit session state.
Display Assistant Message: Displays the assistant's response in the chat interface using Markdown formatting.

These steps allow users to ask questions and receive responses from the AI assistant.

Now we can run the using the following command,

streamlit run <filename>.py

You should see something like this on the browser

A page will be launched at the default port.

In the upcoming series we will explore more types of RAG implementations with AI toolkit.

The RAG Hack series linked below in the resources talks of different kinds of RAG.

Resources

Published on: September 23, 2024

Learn more

Azure Developer Community Blog articles

Building RAG on Phi-3 locally using embeddings on VS Code AI Toolkit

Related posts