Complex Data Extraction using Document Intelligence and RAG

Section 1: Introduction

Historically, data extraction from unstructured documents was a manual and tedious process. It consisted of a combination of constant human involvement or the incorporation of tools that were limited, due to the various document formats, inability to recognize font color and styles, and the document’s data quality being substandard.

These methods were time-consuming and often resulted in errors due to the complexities involved in understanding and interpreting unstructured data. The process also lacked scalability, making it difficult to process large volumes of documents efficiently.

With the advancements of LLMs, the ability to extract data from unstructured documents has significantly improved and offers users a more adaptable solution based on their individual needs.

This guide will show an approach to building a solution for complex entity extraction using Document Intelligence with RAG.

Section 2: Architecture: Complex Entity Extraction using Document Intelligence

Azure Document Intelligence (DI) is great for extracting structured data from unstructured documents in most scenarios. However, as in our case, dealing with tax documents that have thousands of different templates makes it challenging for the DI to capture specific tax information across different tax forms. The wide variety of tax templates comes from the diverse nature of tax documents, since each jurisdiction, city, country and state create its own unique taxation form, with different tax document types (e.g., invoices, withholding forms, declarations, and reports), which makes it an impractical solution to train a DI model for each unique tax form.

To manage unknown structures, we use LLMs to reason through unstructured documents. We utilize DI to extract layout and style information, which is provided to the LLM for extracting required details in a process called Doc2Schema. To query the document, we prompted GPT-4o to leverage DI information, effectively extracting the updated tax information, including but not limited to newly applied tax rates and applied tax locations, all structured according to the specified schema.

Figure 1: Block Diagram

Generally, document querying with LLMs is carried out using Retrieval-Augmented Generation (RAG) models. Following the extraction of layout information through Document Intelligence (DI), semantic chunking is applied to maintain related entities such as tables and paragraphs within a single chunk. These chunks are subsequently embedded into a vector index, which can later be searched using tools like Azure AI Search. By employing prompt engineering, we can then query the document to retrieve targeted information.

Information can be dispersed throughout the document. For instance, a targeted record could be listed in a table row or appear in the introduction, footers, or headers. Searching through these scattered chunks might lead to information loss. To address this problem, we extend the chunk size to the maximum context length of the underlying LLM (128K tokens for GPT4o). Given the type of documents we handle, one chunk can usually hold an entire document. However, this method is also scalable for larger documents by either using smaller chunks and increasing retrieval results (the k-parameter) or by storing essential chunk information in memory.

Figure 2: Document Intelligence + RAG

Section 3: Implementation

Component 1: Azure AI Document Intelligence

For the first component, we leveraged Microsoft Document Intelligence to analyze document layouts efficiently. The process begins by specifying the document file path and configuring the Document Intelligence resource with the resource endpoint and key. Notably, the styles feature is added to the feature list exclusively for PDF documents, as the current capabilities of Document Intelligence do not support styles for HTML documents.

Once configured, the document is analyzed using the "prebuilt layout" model. This model meticulously examines the document, identifying various sections such as paragraphs, figures, tables, styles, and markdown textual content. The function then returns the output response from Document Intelligence in a structured "markdown" format, displaying the detected document sections for easy reference and further processing.

When we examine the main keys in document intelligence’s output response, we observe the following output dictionary keys.

Figure 3: Document Analysis Process

The "content" key contains the document's markdown content, which includes tables, figures, and headlines formatted in markdown. Demonstrating the following sample document and showing its markdown content:

Figure 4: Sample Document

George TBD rate change notice (wa.gov)

Figure 5: Sample Document Content (DI Output)

The "styles" key on the other hand captures all the document's characteristics, such as font weight (normal or bold), font color, whether the text is handwritten, and background color. It groups all spans with the same characteristic into a list. This key map each characteristic to the "content" using span information, indicating the start and end offsets marked with the specific characteristic.

Figure 6: Sample Document Styles (DI Output)

Component 2 – Azure AI Document Intelligence (Styles Feature)

Continuing with Azure AI Document Intelligence component, we aimed to add some visual characteristics to the document contents so the LLM can identify updates which are highlighted in a specific font style. We leveraged the "content" key to parse the textual content of documents in markdown format, which serves as the context for the LLM prompt. By extracting styles, we can identify changes which are highlighted through visual elements such as bold font weight, specific font colors, or background colors. To ensure the LLM recognizes these highlighted changes, we utilized the "styles" key and append the style information to the markdown "content" in the form of tags. For example, using the above sample document, <color: blue> New tax rate is 0.02 </color> indicates that the sentence enclosed within the tags is in blue color. We then merge all consecutive spans sharing similar styles to be enclosed within the same tag, to optimize the context length.

Similarly, we appended the grounding information by considering the span offsets associated with the "color" key. This ensured that all text locations within the document were included in the context, as every piece of text, regardless of its color—even black—should have a specified color attribute as a style.

Figure 7: Applying Text Styles as Tags’ Process

A sample document’s markdown content after adding the styles information looks like the following:

Figure 8: Sample Document Content After Applying Styles Tags

Component 3: Semantic Chunking

Moving to the third component which is semantic chunking, this advanced and effective method manages sequence length based on the document layout. It addresses the challenge of information loss by preserving a full meaningful section per chunk. However, for certain use cases, the complexity and structure of documents can make semantic chunking an impractical approach.

For example, in the sample document displayed above, there are essential notes, such as how a change is highlighted or specific notations important for the extraction task, that are often mentioned in the header or footer of the document. Consequently, if there are multiple chunks within the document, the LLM could not identify the records with changes due to the non-consecutive chunk to these notes.

Addressing this challenge, we maximized the number of tokens per chunk to match the maximum sequence length of GPT-4o model which is 128k token and token overlap set to 64k tokens.

Initially, we used Byte Pair Encoding (BPE) for better grammar understanding. TokenEstimator class is designed to estimate the number of tokens in a given text using the GPT-4 tokenizer. It provides estimate_tokens function, which encodes the text and returns the number of tokens.

Figure 9: Semantic Chunking Process

Showing the number of the resulting chunks for the sample document:

Figure 10: Display Number of the Resulting chunks

Component 4: Azure OpenAI – LLM Prompting

The fourth component focuses on the LLM call. In our experiment, we utilized the "GPT-4o" model deployed in Azure OpenAI Studio, providing the necessary resource credentials and the API version of the deployment. We read the prompt file to manage prompt versioning within the pipeline, then called the chat completion API, passing the prompt template along with the context. We implemented retry mechanisms for the API call to handle potential failures due to quota management or incomplete JSON responses from the OpenAI API. Additionally, we added the "force_json" option to ensure the output response is serialized in JSON format.

Figure 11: LLM Call using AzureOpenAI Framework

Now let's examine how the prompt was crafted to achieve this extraction use case.

We divided the task into three key steps, providing detailed guidance for the LLM at each stage. First, the LLM needed to identify how tax rate change records are highlighted within the document, supported by possible notations such as "*", bolded, or colored text.

Next, we defined the information to be extracted as fields, with a description for each, including specific formatting requirements, such as for dates. Finally, the extracted fields were formatted into a predefined JSON schema, creating a list of JSON objects where each object represents a tax rate change record for a particular tax type.

Throughout the process, grounding information for each extracted field was preserved. This list of objects was then placed under a dummy key, "results," allowing all extracted objects to be captured when the response_format is set to enforce JSON output.

Figure 12: Prompt Given to the LLM

Showing the final output of the chosen sample document:

Figure 13: Extracted Records Corresponding to the Sample Document.

As shown above, each JSON object represents a changed tax rate applied within a specified location on a specific tax rate level. Additionally, each field incorporates its location within the document (the start and end offsets are calculated by the model based on the above prompt from the original document content) to smoothly identify its source in the original document.

In the next section, we will show insights into this experiment's overall evaluation.

Section 4: Evaluation & Metrics

Using ground truth and prediction data, we measure precision, recall, and F1 as our metrics. To be a correct prediction, all fields in the ground truth record must match exactly with those in the predicted records.

Metric	Value
Precision	56.39
Recall	37.08
F1 Score	44.74

To conduct a thorough analysis of our results, we performed an ablation study to determine which entities impact the extraction process. The figure below shows the metrics derived by progressively adding one entity to the record. A decrease in metrics upon including an entity suggests an issue with that entity.

We also analyze documents to find further issues. Sometimes, GPT4o's predictions are more accurate than the ground truth. For instance, the model can identify a parent city from the title that the ground truth misses. Despite being correct, the record is marked wrong due to a parent_city mismatch. This shows GPT may surpass human annotations, particularly with scattered information, highlighting the need for multiple reviews of the ground truth using GPT.