Building the Backbone: Data Ingestion and IVF index creation for DTDL's Autonomous Assistant

Welcome back! In this blog, we will learn how to prepare data for our retrieval process. There are various approaches to this, but we’ll use Azure Document Intelligence to extract information from unstructured documents. By the end, you’ll know how to:

  • Work with the Azure Document Intelligence Layout model 📄
  • Perform custom chunking and add metadata 📝
  • Vectorize data into embeddings 🔍

The nature of data can vary greatly and often includes unstructured formats such as PDFs, audio, video, and images. Our first step is to prepare this data so we can retrieve relevant information from it. While we could use a simple parser to get text content from PDFs, not all information will be in plain text. PDFs can contain complex tables, images, and other elements that might confuse LLMs if extracted through a simple parser. Therefore, we need to rely on advanced services that handle this task more effectively.

In our case, since DTDL maintains PDFs for FAQs, we used Azure Document Intelligence to extract information.

Azure AI Document Intelligence is a cloud-based service that automates document processing by extracting structured data from various document types using prebuilt, custom, and document analysis models. It supports diverse use cases such as invoice processing, identity verification, and tax form management, and offers integration through APIs and SDKs.

Ah, I’m not going to bore you with the details on how to create it. Please follow this doc that provides all the steps to create this service in the Azure portal.

If you have been following my earlier blogs, you know that I choose the Custom model to extract information from documents. However, I haven’t covered when to use each model in detail. So, here is a detailed mindmap for choosing the right model for your needs:

  • Go to Azure Document intelligent studio
  • Select layout model
  • Upload the PDF file you want to analyze
  • Click on Run analyze button

The Azure Document Intelligence service processes documents and returns the content organized into sections and paragraphs. This allows us to easily create chunks based on these sections and their associated paragraphs. The logic is straightforward: each paragraph is grouped under the most recent section title until a new section title is encountered.

Next, we need to implement this logic in our function app. We’ll set up an HTTP trigger for uploading files, and in the handler, we’ll use the Azure Document Intelligence .NET SDK to extract information from our FAQ PDF.

Here’s the code logic for grouping sections by headings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
        private static IEnumerable<Section> GroupSectionsByHeading(IEnumerable<DocumentParagraph> paragraphs)
        {
            var sections = new List<Section>();
            Section currentSection = null;

            foreach (var paragraph in paragraphs)
            {
                if (paragraph.Role == "sectionHeading")
                {
                    if (currentSection != null)
                    {
                        sections.Add(currentSection);
                    }
                    currentSection = new Section
                    {
                        Heading = paragraph.Content,
                        Paragraphs = new List<string>()
                    };
                }
                else if (currentSection != null && paragraph.Role == null)
                {
                    currentSection.Paragraphs.Add(paragraph.Content);
                }
            }

            if (currentSection != null)
            {
                sections.Add(currentSection);
            }

            return sections;
        }

Once we have our sections, we’ll proceed to chunk them, vectorize the chunks, and store the results in Azure Cosmos DB.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17

                    var sections = GroupSectionsByHeading(faqRecognizerTask.Result.Paragraphs);

                    var tasks = sections.Select(async section =>
                    {
                        var content = string.Join("\n", section.Paragraphs);
                        var (embeddings, tokens) = await _openAIService.GetEmbeddingsAsync(content);

                        return new SectionEmbedding
                        {
                            Id = Guid.NewGuid().ToString(),
                            Heading = section.Heading,
                            Content = content,
                            EmbeddingVector = embeddings,
                            Tokens = tokens
                        };
                    });

DTDL-CustomerCare

https://dghx6mmczrmj4mk-web.azurewebsites.net/

CAUTION : If the site is slow, it is due to the serverless infrastructure used for this application.