Turning Archaeological Documents into Searchable Knowledge with Cosmos DB Vector Search

Divakar Kumar included in categories Azure AI and series Archaios

2025-08-27 2025-08-27 1519 words 8 minutes

https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mw3kozl57jfatk9qd3rv.png

Series -

Contents

📖 The Problem: Field Reports Are Not Agent-Friendly

Archaeological field reports, historical journals, and excavation notes are invaluable—but they’re locked in PDFs and text documents. When our multi-agent AI team (from Blog 4) analyzes satellite imagery or LiDAR data, they need historical context:

“Has this site been previously surveyed?”
“What artifacts were found in similar geological formations?”
“What do historical texts say about settlements in this region?”

Traditional keyword search fails here. An agent searching for “Bronze Age burial mounds” might miss a document that says “Early metallurgical period funerary structures.” We need semantic search—understanding meaning, not just matching words.

This is where Cosmos DB Vector Search transforms unstructured documents into queryable archaeological knowledge.

🏗️ The Architecture: From Documents to Discovery

Inspired by Microsoft’s Logic Apps document indexing pattern, I extended the design for archaeological research:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


Research Documents (Field Reports, Journals, Historical Contexts)
               ↓
       Document Processing
               ↓
      Chunking (manageable text segments)
               ↓
   OpenAI text-embedding-ada-002 (vectorization)
               ↓
  Azure Cosmos DB with DiskANN Vector Index
               ↓
    AI Agents Query Historical Context via Semantic Search

The key insight: documents become coordinates in semantic space. Similar concepts cluster together, regardless of exact wording.

🔢 Step 1: Vectorizing Historical Knowledge

⚙️ The Embedding Service

At the core, we use Azure OpenAI’s text-embedding-ada-002 model to convert text into 1536-dimensional vectors (think of them as coordinates in “meaning space”):

1
2
3
4
5
6
7
8
9


// Program.cs - Dependency Injection Setup
services.AddSingleton<ITextEmbeddingGenerationService, AzureOpenAITextEmbeddingGenerationService>(provider =>
{
    return new AzureOpenAITextEmbeddingGenerationService(
        deploymentName: openaiTextEmbeddingGenerationDeploymentName!,
        endpoint: openaiEndpoint!,
        apiKey: openaiApiKey!
    );
});

Why text-embedding-ada-002?

High semantic fidelity: Captures nuanced meanings in archaeological terminology
Cost-effective: Compared to larger embedding models
Proven at scale: Powers semantic search in production RAG systems

💾 Step 2: Storing Vectors in Cosmos DB

📦 Container Design: `archaeologyCorpus`

The Cosmos DB container stores document chunks with their vector embeddings:

1
2
3
4
5
6
7


{
  "id": "field-report-petra-1985-chunk-3",
  "content": "Excavation revealed pottery fragments consistent with Nabataean period (1st century BCE). Terracotta pieces show evidence of trade routes extending to Mediterranean coastal cities.",
  "vector": [0.0234, -0.0521, 0.1023, ... 1533 more dimensions],
  "sourceDocument": "field-report-petra-1985.pdf",
  "pageNumber": 12
}

Key design decisions:

Chunks, not full documents: Large documents (200+ pages) are split into semantically coherent segments. This ensures:
- Precise retrieval (get the exact paragraph, not entire PDF)
- Better context for AI agents (focused information)
DiskANN vector index: Cosmos DB’s high-performance approximate nearest neighbor (ANN) algorithm
- Sub-millisecond query latency at scale
- Handles millions of vectors efficiently
- No separate vector database needed

🔍 Step 3: Vector Search Repository

The CosmosDbVectorRepository encapsulates semantic search:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53


public class CosmosDbVectorRepository : IVectorRepository
{
    private readonly CosmosClient _cosmosClient;
    private Container _container;
    private readonly string databaseId;
    private readonly QueryRequestOptions _queryOptions;

    public CosmosDbVectorRepository(CosmosClient cosmosClient, IConfiguration configuration)
    {
        _cosmosClient = cosmosClient;
        databaseId = configuration.GetValue<string>("CosmosDb:DatabaseId")!;
        _queryOptions = new QueryRequestOptions
        {
            MaxItemCount = -1,
            MaxConcurrency = -1
        };
    }

    public async Task<List<dynamic>> FetchDetailsFromVectorSemanticLayer(
        ReadOnlyMemory<float> embedding, 
        string prompt, 
        string containerId = "")
    {
        if (string.IsNullOrWhiteSpace(containerId))
            containerId = "archaeologyCorpus";

        _container = _cosmosClient.GetContainer(databaseId, containerId);

        var queryDefinition = new QueryDefinition($@"
            SELECT Top @topN
                c.id, c.content, VectorDistance(c.vector, @embedding) as similarityScore
            FROM c
            ORDER BY VectorDistance(c.vector, @embedding)
        ");
        
        queryDefinition.WithParameter("@embedding", embedding.ToArray());
        queryDefinition.WithParameter("@topN", 3);
        queryDefinition.WithParameter("@prompt", prompt);

        var results = new List<dynamic>();
        using (var resultSetIterator = _container.GetItemQueryIterator<dynamic>(
            queryDefinition, requestOptions: _queryOptions))
        {
            while (resultSetIterator.HasMoreResults)
            {
                var response = await resultSetIterator.ReadNextAsync();
                results.AddRange(response);
            }
        }

        return results;
    }
}

What’s happening:

VectorDistance(c.vector, @embedding): Cosmos DB’s native vector similarity function
Top N retrieval: Get the 3 most semantically similar document chunks
Similarity score: Lower scores = higher similarity (cosine distance)
Dynamic results: Flexible schema for different document types

🤖 Step 4: AI Agent Integration - The Knowledge Plugin

AI agents query historical knowledge through a Semantic Kernel plugin:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


public class ChatVectorSearchPlugin
{
    private readonly IServiceProvider _serviceProvider;
    private readonly ILogger<ChatVectorSearchPlugin> _logger;

    public ChatVectorSearchPlugin(IServiceProvider serviceProvider)
    {
        _serviceProvider = serviceProvider;
        _logger = serviceProvider.GetRequiredService<ILogger<ChatVectorSearchPlugin>>();
    }

    [KernelFunction("SimilaritySearchAsync")]
    [Description("Search for similarities in archaeologyCorpus based on the user query")]
    public async Task<string> SimilaritySearchAsync(
       [Description("Query prompt")]
        string prompt
    )
    {
        try
        {
            _logger.LogInformation($"Executing SimilaritySearchAsync with prompt: {prompt}");

            // Step 1: Get the vector repository
            var cosmosService = _serviceProvider.GetRequiredService<IVectorRepository>();

            // Step 2: Convert user query to embedding
            var embeddingService = _serviceProvider.GetRequiredService<ITextEmbeddingGenerationService>();
            var embeddingQuery = await embeddingService.GenerateEmbeddingAsync(prompt);
            var embeddingQueryArray = embeddingQuery.ToArray();

            _logger.LogInformation($"Generated embedding for query: {prompt}");

            // Step 3: Search for similar vectors in Cosmos DB
            var response = await cosmosService.FetchDetailsFromVectorSemanticLayer(
                embeddingQueryArray, prompt);

            _logger.LogInformation($"Retrieved {response.Count} results from vector search.");

            if (response.Count == 0)
            {
                _logger.LogWarning("No results found for the given query.");
                return "No information found!";
            }

            var serializedResponse = JsonConvert.SerializeObject(response, Formatting.Indented);
            _logger.LogInformation($"Serialized response: {serializedResponse}");

            return serializedResponse;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "An error occurred while executing SimilaritySearchAsync.");
            return "Error retrieving information";
        }
    }
}

The RAG Pattern in Action:

User query: “What artifacts were found near water sources in Jordan?”
Embedding generation: Query → 1536-dimensional vector
Vector search: Find top 3 similar document chunks in archaeologyCorpus
Context retrieval: Return relevant field report excerpts
Agent reasoning: AI agent combines retrieved knowledge with satellite/LiDAR analysis

💡 Why This Matters: Knowledge Multiplies Agent Intelligence

1
2
3


Agent: "I detect a rectangular anomaly at coordinates (35.4795, 31.8564)."
User: "What is it?"
Agent: "Unknown structure. Could be modern, ancient, or geological."

✅ After Vector Search (Context-Aware Analysis):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


Agent queries: "rectangular structures near Petra Jordan ancient"

Vector Search returns:
- Field Report 1985: "Nabataean water reservoir, rectangular design, 15m x 8m"
- Journal 2003: "Cisterns along trade routes had standardized dimensions"
- Historical Text: "Roman-period water management in Petra utilized carved rock chambers"

Agent: "High confidence this is a Nabataean-era water cistern. Similar structures 
documented 400m north (1985 excavation). Dimensions match known cistern patterns 
from Roman-Nabataean period (1st century BCE - 2nd century CE)."

The difference: Vector search transformed speculation into evidence-based analysis.

⚙️ Configuration & Deployment

🔧 Required Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


{
  "OpenAI": {
    "Endpoint": "https://your-openai-resource.openai.azure.com/",
    "ApiKey": "your-api-key",
    "TextEmbeddingGenerationDeploymentName": "text-embedding-ada-002"
  },
  "CosmosDb": {
    "ConnectionString": "AccountEndpoint=https://...",
    "DatabaseId": "archaios",
    "ArchaeologyCorpusContainerId": "archaeologyCorpus"
  }
}

📊 Creating the Vector Index (Cosmos DB Portal or SDK)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/*"
    }
  ],
  "vectorIndexes": [
    {
      "path": "/vector",
      "type": "diskANN"
    }
  ]
}

DiskANN parameters:

Type: diskANN (Cosmos DB’s high-performance ANN algorithm)
Dimensions: 1536 (matching text-embedding-ada-002 output)
Similarity metric: Cosine distance (default)

⚡ Performance Considerations

⏱️ Query Latency

P99 latency: ~50-100ms for vector search (3-5 results)
Embedding generation: ~200-300ms (OpenAI API call)
Total RAG cycle: ~300-500ms (acceptable for real-time agent interactions)

📈 Scaling

DiskANN shines at scale: Handles millions of vectors with minimal latency increase
Cosmos DB auto-scaling: RU/s adjusts based on query load
Partition strategy: Use sourceDocument or site as partition key for geo-distributed queries

💰 Cost Optimization

Embedding caching: Store embeddings once, query thousands of times
Batch processing: Vectorize documents offline during ingestion
Shared throughput: Use database-level provisioned throughput for predictable costs

📝 Lessons Learned

1️⃣ Chunk Size Matters

Too small (50 words): Loses context
Too large (2000 words): Dilutes semantic signal
Sweet spot: 200-500 words per chunk (1-2 paragraphs)

2️⃣ Metadata Enrichment

Include metadata in embeddings:

1
2


Content to embed: "[Source: Field Report 1985, Petra, Jordan, Page 12] 
Excavation revealed pottery fragments consistent with Nabataean period..."

This helps the model understand document provenance during embedding generation.

3️⃣ Observability is Critical

Log:

Embedding generation time
Vector search latency
Similarity scores (how confident is the match?)
User query → retrieved documents (for quality tuning)

🌍 The Bigger Picture: RAG for Archaeological Discovery

This isn’t just about searching documents—it’s about institutional memory. Every excavation, every field report, every expert observation becomes queryable knowledge that:

Prevents redundant work: “Has this site been surveyed before?”
Surfaces forgotten insights: A 1980s journal entry mentions pottery—relevant to today’s AI analysis
Democratizes expertise: Junior archaeologists access decades of senior knowledge instantly
Scales human intelligence: AI agents synthesize patterns across 100,000+ documents (impossible manually)

Vector search transforms data graveyards (PDFs in folders) into living knowledge graphs.

📚 Next in the Series

Blog 7: [Scalable Container-Based Processing Architecture] - How Azure Container Apps orchestrate LiDAR and satellite processing at scale, with event-driven auto-scaling based on upload queues.

💻 Source Code

Vector Repository: Archaios.AI.Infrastructure/Repositories/Cosmos/CosmosDbVectorRepository.cs
Search Plugin: Archaios.AI.DurableHandler/Agents/Chat/Plugins/ChatVectorSearchPlugin.cs
Configuration: Archaios.AI.DurableHandler/Program.cs (lines 113-118)

All code examples in this blog are from the actual Archaios production codebase—no fictional implementations.

Vector search isn’t about replacing archaeologists—it’s about giving them superhuman recall of every relevant document ever written about a site. That’s the power of semantic knowledge retrieval.