Building the Backbone: Data Ingestion and IVF index creation for DTDL's Autonomous Assistant

Divakar Kumar included in category Azure and series CustomerCare-Dtdl

2024-06-10 2024-06-10 655 words 4 minutes

https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i16jf4zx7wd36kdxzh0y.gif

Series -

Contents

1. About this blog

Welcome back! In this blog, we will learn how to prepare data for our retrieval process. There are various approaches to this, but we’ll use Azure Document Intelligence to extract information from unstructured documents. By the end, you’ll know how to:

Work with the Azure Document Intelligence Layout model 📄
Perform custom chunking and add metadata 📝
Vectorize data into embeddings 🔍

2. Dataaaa

The nature of data can vary greatly and often includes unstructured formats such as PDFs, audio, video, and images. Our first step is to prepare this data so we can retrieve relevant information from it. While we could use a simple parser to get text content from PDFs, not all information will be in plain text. PDFs can contain complex tables, images, and other elements that might confuse LLMs if extracted through a simple parser. Therefore, we need to rely on advanced services that handle this task more effectively.

In our case, since DTDL maintains PDFs for FAQs, we used Azure Document Intelligence to extract information.

3. Architecture

Azure Document Intelligence

Azure AI Document Intelligence is a cloud-based service that automates document processing by extracting structured data from various document types using prebuilt, custom, and document analysis models. It supports diverse use cases such as invoice processing, identity verification, and tax form management, and offers integration through APIs and SDKs.

Step 1: Create Resource

Ah, I’m not going to bore you with the details on how to create it. Please follow this doc that provides all the steps to create this service in the Azure portal.

Step 2: Decide on which model to use

If you have been following my earlier blogs, you know that I choose the Custom model to extract information from documents. However, I haven’t covered when to use each model in detail. So, here is a detailed mindmap for choosing the right model for your needs:

Step 3: Analyze PDF with layout model

Go to Azure Document intelligent studio
Select layout model
Upload the PDF file you want to analyze
Click on Run analyze button

The Azure Document Intelligence service processes documents and returns the content organized into sections and paragraphs. This allows us to easily create chunks based on these sections and their associated paragraphs. The logic is straightforward: each paragraph is grouped under the most recent section title until a new section title is encountered.

Azure Function - Data Ingestor

Next, we need to implement this logic in our function app. We’ll set up an HTTP trigger for uploading files, and in the handler, we’ll use the Azure Document Intelligence .NET SDK to extract information from our FAQ PDF.

Here’s the code logic for grouping sections by headings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


        private static IEnumerable<Section> GroupSectionsByHeading(IEnumerable<DocumentParagraph> paragraphs)
        {
            var sections = new List<Section>();
            Section currentSection = null;

            foreach (var paragraph in paragraphs)
            {
                if (paragraph.Role == "sectionHeading")
                {
                    if (currentSection != null)
                    {
                        sections.Add(currentSection);
                    }
                    currentSection = new Section
                    {
                        Heading = paragraph.Content,
                        Paragraphs = new List<string>()
                    };
                }
                else if (currentSection != null && paragraph.Role == null)
                {
                    currentSection.Paragraphs.Add(paragraph.Content);
                }
            }

            if (currentSection != null)
            {
                sections.Add(currentSection);
            }

            return sections;
        }

Once we have our sections, we’ll proceed to chunk them, vectorize the chunks, and store the results in Azure Cosmos DB.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17



                    var sections = GroupSectionsByHeading(faqRecognizerTask.Result.Paragraphs);

                    var tasks = sections.Select(async section =>
                    {
                        var content = string.Join("\n", section.Paragraphs);
                        var (embeddings, tokens) = await _openAIService.GetEmbeddingsAsync(content);

                        return new SectionEmbedding
                        {
                            Id = Guid.NewGuid().ToString(),
                            Heading = section.Heading,
                            Content = content,
                            EmbeddingVector = embeddings,
                            Tokens = tokens
                        };
                    });

5. Source code

DTDL-CustomerCare

6. Live URL

https://dghx6mmczrmj4mk-web.azurewebsites.net/

CAUTION : If the site is slow, it is due to the serverless infrastructure used for this application.