NL to SQL Architecture Alternatives

Table of Contents

Introduction. 2

Overview.. 2

Architecture 1: Few Shot SQL Generation. 3

Architecture 2: Few Shot SQL Generation with RAG.. 4

Architecture 3: Fine Tuning SQL Generation with GPT 3.5T. 5

Evaluation. 6

Mitigations: 7

Additional resources. 8

Prompt Examples. 8

Introduction

The recent advances in LLMs have created opportunities for businesses to gain leverage and insight into their unstructured text data. A classic example of this has been using the popular Retrieval Augmented Generation (RAG) pattern in which company documents are used as context to answer user questions. These implementations are reducing the time spent reading through lengthy documentation by offloading this task to the LLMs.

But what about using LLMs to extract insights from structured relational data such as SQL databases? Many enterprises possess enormous quantities of SQL data that currently require technical users to query or build business dashboards to make this data accessible to end users. However, it is costly to build these dashboards, and they are limited in their scope.

With LLMs, there is a new potential to enable non-technical users to extract information from SQL databases by using an LLM to generate SQL queries dynamically to answer their Natural Language Questions. This use case has potential to enhance understanding of business data and reduce time and money spent on extracting data from SQL databases.

Overview

This guide will show multiple approaches to building a chat application that leverages SQL data as context. We will start with the simplest architecture and work progressively with more sophisticated methods to address limitations.

Note: Truncated prompts will be provided in the examples below, but more complete and comprehensive prompts will be included in the final section.

Architecture 1: Few Shot SQL Generation

In this first implementation of an NL to SQL chat app, the LLM builds a SQL query based on the user’s question and a fixed SQL schema that is included in the system prompt.

Figure 1 – Few Shot SQL Generation

User asks a question: for example, “What was the product line with the highest revenue in 2022?”
App makes a request to the LLM: prompt might look like the following,

“You are an expert at providing facts from a SQL Database. Given the user question, produce a PostgresSQL query which, when executed provides the correct answer to the question. Only use the tables and schema given below to answer the question. SQL Schema: {Schema}

User Question: {Question}”

LLM responds with a SQL query: response might look like, “SELECT p.product_line, SUM(s.revenue) AS total_revenue … “. This string gets stored as a variable, to be used in the next step.
SQL Query string is executed, result is saved: the previous string that was generated is sent to the SQL query to be executed. (Note: this poses a potential risk of SQL injection, ensure you are limiting permissions, and see final section regarding Security and Safety). The result of the SQL query will be saved as a text string (e.g. “Home Appliances, $12,500,000”)
Final LLM call performed: Another request is made to the LLM, this time with the user’s original question, and the result of SQL Query. The prompt might look like:

“Given the original question, the corresponding PostgreSQL query, and the execution results, produce a truthful and accurate natural language response to the original question.

Question: {question}”

Final Response: The LLM responds with the answer to the user’s question.

Benefits / Limitations: This architecture is quite simple and benefits in terms of latency by having only two LLM calls. However, this setup is limited because the LLM does not have any examples of valid SQL queries to reference.

Architecture 2: Few Shot SQL Generation with RAG

This implementation is similar to the previous architecture, but rather than having a fixed set of few-shot examples, we create a data preparation step in which we embed examples into a vector database, and then retrieve the examples that are most relevant to the user’s question.

Figure 2 – Few Shot SQL Generation with RAG

Below, we discuss the new steps that differentiate Architecture 2 from Architecture 1:

Offline Data Prep: In order to best leverage the few shot examples framework, a set of NL SQL Questions and corresponding SQL Queries (50-100 samples) should be collected and stored in JSON or CSV format. Then, each of these pairs should be embedded in a vector database such as Azure AI Search (Search over JSON blobs - Azure AI Search | Microsoft Learn).

e.g. {

"question": "Which product line had the highest profit margin in 2020?",

"sql_query": "SELECT value FROM table_1…",

}

Embed user’s question: Make a request to a text embeddings model, such as text-ada-002, to convert the user’s question to a vector.
Retrieve few shot examples: Then retrieve the top 5 examples with the closest similarity (using Azure AI Search cosine similarity algorithm) and include them in the SQL Generation request to GPT-4.

Architecture 3: Fine Tuning SQL Generation with GPT 3.5T

One of the central challenges in building an NL to SQL based chat application is handling latency. One method to reduce latency is by opting to fine tune a smaller model, such as GPT 3.5T for the SQL generation step.

Figure 3 – Fine Tuning SQL Generation with GPT 3.5T

This architecture follows roughly the same pattern as architecture 1, but with a preparatory fine tuning step.

Offline Fine Tuning – Collect 50-100 NL questions and corresponding SQL Queries. Use Azure OpenAI’s fine tuning capability to fine tune your GPT 3.5 deployment (Customize a model with Azure OpenAI Service - Azure OpenAI | Microsoft Learn)
Steps 2-7 – Same as architecture 1, but directed toward the fine tuned instance of GPT 3.5T.

Note: If sufficient NL / SQL samples are available, fine tuning could be used with RAG to include few shot examples in addition to the fine tuning. Customers should experiment with the exact number of samples to include to get best results and ensure to test with unseen samples.

Evaluation

In order to evaluate an NL to SQL project you will need to evaluate the accuracy of the generated SQL query, the final response, and latency for each step. Below, we outline some of the key metrics to capture during the evaluation stage.

Manual Metrics:

The simplest way of evaluating the application is to have a human evaluator look at the final response and determine if it answer’s the user’s original question correctly. An example of this is End to End Accuracy.

End to End Accuracy – does the final response correctly answer the user’s question?
1. How to implement: Output all question and answers to a spreadsheet, and have a human validator enter 1 or 0 in each row to indicate whether the answer was correct or incorrect. Correct Answers / Total Answers = Accuracy.

The limitations with this method is that it requires a person to look at each response which can be a lengthy process. Additionally, if the answer is marked incorrect, it may not be clear which part of the process failed. For this reason, we also introduce a set of automated metrics to evaluate each step of the process.

Automated Metrics:

Validity of SQL Query – Is the generated SQL query valid?
1. How to Implement: This can be scored using a python library like sqlvalidator, or by implementing logic into the code to log each instance in which the SQL query fails.
SQL Critic Score – Does the SQL Query and result correctly translate the original question?
1. How to implement: Use the LLM to evaluate itself by sending a prompt with instructions to validate the accuracy of the SQL Query and include the original question, SQL query, and result in the request (see Appendix B Example 5)
End to End Relevance – Is the generated NL response relevant to the user’s initial question?
1. How to implement: Send a prompt to the LLM with instructions to compare the NL response to the user’s initial question and evaluate its relevance (see Appendix B Example 6)
End to End Groundedness – Does the generated NL response logically follow from the context provided to it (i.e. the SQL Query)?
1. How to implement: Send a prompt to the LLM with instructions to compare the NL response to the SQL query and results and determine if the response logically follows (see Appendix B Example 7)

Figure 4 – NL to SQL Evaluation Metrics

Safety / Security

LLMs introduce new risks to a web application. In this case, the LLM is generating SQL code that will be executed. This introduces a attack surface, and the appropriate boundaries should be introduced prior to releasing to public.

NL to SQL systems require various safety precautions to ensure accurate, secure, and unbiased results. Here are some key aspects to consider.

Risks:

SQL Injection
Handling ambiguity and vagueness
Mitigating bias
Unintentionally writing to database

Mitigations:

Granular permissions and proper authorization – refine access based on roles and groups.
Ensure access to known users.
Implement parameterized queries that separate data from the query itself.
Ensure read-only and only execute permissions.
Implement strict input validation and sanitization procedures with whitelisting, regular expressions, data type checks, escaping.
Implement logging and monitoring.
Explainability and transparency

Additional resources

Appendix A: Prompt Examples

Example 1: NL to SQL System Prompt

"You are an expert in providing facts from an SQL database. Given the user question, produce a PostgreSQL query which, when executed, provides the correct answer to the question. Only use the tables and schema given below to answer the questions. Learn query patterns from the similar question-answer pairs provided as examples below. Directly start your response with the executable query. Do not start with '''sql. Do not make up new table and column names. Only use the ones available in the schema below.

SQL Schema:

{schema}

Examples:

{examples}"

Example 2: SQL to NL System Prompt

"Given the original question, the corresponding PostgreSQL query, and the execution results, produce a truthful and accurate natural language response to the original question, that is only based on the execution results. Your answer should be detailed and accurate, for numerical responses make sure you infer the appropriate units. If the question is complex, you can provide a multi-sentence response. Make sure that you provide a response that is easy to understand."

Example 3: NL to SQL User Prompt

"Your response should be an executable query, do not start with '''sql Provide detailed reasoning behind the PostgreSQL query, please provide it as a comment using -- at the beginning of the line. Formatting example:

SELECT some_name, val FROM some_table WHERE a = number AND b = another_number -- This query selects some_name and val from some_table where a = number and b = another_number. Queries should return a minimal amount of rows, and should not return more rows than necessary.

User question: {query}

PostgreSQL query:"

Example 4: SQL to NL User Prompt

"Question: {question} SQL Query: {sql_query} SQL Query Results: {sql_query_results}"

Appendix B: Evaluation Prompt Examples

Example 5: SQL Critic Score Prompt

"Your job is to evaluate the nl-to-sql engine of the USA-Facts app.

The app goes through several steps:

The user question is sent to a Natural Language to SQL converter along with information about the database schema.
The generated SQL query is executed and the output is logged.

Your job is to determine if the generated SQL correctly translates the original question.

User question: {question}

Generated PostgreSQL: {generated_query}

Executed PostgreSQL result: {query_execution_result}

Given the information above, give a numeric score of 0 to the Generated SQL if it doesn't correctly handle the User question, and give a numeric score of 1 if the Generated SQL query correctly handles the User question.

If the SQL query yields an error, give a numeric score of 0.

If the SQL query doesn't error out, but doesn't correctly handle the User question, give a numeric score of 0.

If the SQL execution results are empty, most likely it didn't handle the User's question. Think hard if you want to give it a 1 or a 0.

Score: "

Example 6: End to End Relevance

"You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a government stats question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:

10: The ANSWER follows logically from the information contained in the CONTEXT.

0: The ANSWER is logically false from the information contained in the CONTEXT.

an integer score between 0 and 10 and if such integer score does not exists, use 0: It is not possible to determine whether the ANSWER is true or false without further information.

Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.

Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.

Independent Examples:

Example Task #1 Input:

"CONTEXT":

Query: {Sample SQL Query}

Query results: {Sample SQL Query Result}

"ANSWER": "{Valid Answer}”

Example Task #1 Output:

Example Task #2 Input:

"CONTEXT":

Query: "SELECT area_name, MAX(value) - MIN(value) as increase_in_unemployment_rate

FROM bls_local_area_unemployment_statistics

WHERE year >= 2010 AND measure = 'unemployment rate' AND area_type = 'cities and towns above 25,000 population'

GROUP BY area_name

ORDER BY increase_in_unemployment_rate DESC

LIMIT 5";

Query results: San Luis City, AZ; Atlantic City city, NJ; Flint city, MI; Mount Morris township, MI; and Detroit city, MI

"ANSWER": "The 5 cities where the unemployment rate has increased the most since 2010 are San Luis city, AZ; Atlantic City city, NJ; Flint city, MI; Mount Morris township, MI; and Detroit city, MI."

Example Task #2 Output:

Example Task #3 Input:

"CONTEXT":

Query: SELECT SUM(value) AS total_job_openings

FROM bls_job_turnover

WHERE state = 'Washington' AND year = 2020

LIMIT 10;

Query results: 15773441, 18150000

"ANSWER": "The GDP of Washington state in 2020 was approximately $9,773,441,000,000.0 and the number of job openings was approximately 18,150,000."

Example Task #3 Output:

"""

Example 7: End to End groundedness

“Is the answer relevant and related to the question?

If the answer is yes, output 1.

Respond 0 otherwise.

Question: {question}

Answer: {answer}

Score:”

Published on: May 09, 2024

Learn more

Azure Architecture Blog articles

NL to SQL Architecture Alternatives

Introduction

Overview

Architecture 1: Few Shot SQL Generation

Architecture 2: Few Shot SQL Generation with RAG

Architecture 3: Fine Tuning SQL Generation with GPT 3.5T

Evaluation

Additional resources

Appendix A: Prompt Examples

Appendix B: Evaluation Prompt Examples

Related posts