Turning Archaeological Documents into Searchable Knowledge with Cosmos DB Vector Search

Archaeological field reports, historical journals, and excavation notes are invaluable—but they’re locked in PDFs and text documents. When our multi-agent AI team (from Blog 4) analyzes satellite imagery or LiDAR data, they need historical context:

  • “Has this site been previously surveyed?”
  • “What artifacts were found in similar geological formations?”
  • “What do historical texts say about settlements in this region?”

Traditional keyword search fails here. An agent searching for “Bronze Age burial mounds” might miss a document that says “Early metallurgical period funerary structures.” We need semantic search—understanding meaning, not just matching words.

This is where Cosmos DB Vector Search transforms unstructured documents into queryable archaeological knowledge.


Inspired by Microsoft’s Logic Apps document indexing pattern, I extended the design for archaeological research:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Research Documents (Field Reports, Journals, Historical Contexts)
       Document Processing
      Chunking (manageable text segments)
   OpenAI text-embedding-ada-002 (vectorization)
  Azure Cosmos DB with DiskANN Vector Index
    AI Agents Query Historical Context via Semantic Search

The key insight: documents become coordinates in semantic space. Similar concepts cluster together, regardless of exact wording.


At the core, we use Azure OpenAI’s text-embedding-ada-002 model to convert text into 1536-dimensional vectors (think of them as coordinates in “meaning space”):

1
2
3
4
5
6
7
8
9
// Program.cs - Dependency Injection Setup
services.AddSingleton<ITextEmbeddingGenerationService, AzureOpenAITextEmbeddingGenerationService>(provider =>
{
    return new AzureOpenAITextEmbeddingGenerationService(
        deploymentName: openaiTextEmbeddingGenerationDeploymentName!,
        endpoint: openaiEndpoint!,
        apiKey: openaiApiKey!
    );
});

Why text-embedding-ada-002?

  • High semantic fidelity: Captures nuanced meanings in archaeological terminology
  • Cost-effective: Compared to larger embedding models
  • Proven at scale: Powers semantic search in production RAG systems

The Cosmos DB container stores document chunks with their vector embeddings:

1
2
3
4
5
6
7
{
  "id": "field-report-petra-1985-chunk-3",
  "content": "Excavation revealed pottery fragments consistent with Nabataean period (1st century BCE). Terracotta pieces show evidence of trade routes extending to Mediterranean coastal cities.",
  "vector": [0.0234, -0.0521, 0.1023, ... 1533 more dimensions],
  "sourceDocument": "field-report-petra-1985.pdf",
  "pageNumber": 12
}

Key design decisions:

  1. Chunks, not full documents: Large documents (200+ pages) are split into semantically coherent segments. This ensures:

    • Precise retrieval (get the exact paragraph, not entire PDF)
    • Better context for AI agents (focused information)
  2. DiskANN vector index: Cosmos DB’s high-performance approximate nearest neighbor (ANN) algorithm

    • Sub-millisecond query latency at scale
    • Handles millions of vectors efficiently
    • No separate vector database needed

The CosmosDbVectorRepository encapsulates semantic search:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
public class CosmosDbVectorRepository : IVectorRepository
{
    private readonly CosmosClient _cosmosClient;
    private Container _container;
    private readonly string databaseId;
    private readonly QueryRequestOptions _queryOptions;

    public CosmosDbVectorRepository(CosmosClient cosmosClient, IConfiguration configuration)
    {
        _cosmosClient = cosmosClient;
        databaseId = configuration.GetValue<string>("CosmosDb:DatabaseId")!;
        _queryOptions = new QueryRequestOptions
        {
            MaxItemCount = -1,
            MaxConcurrency = -1
        };
    }

    public async Task<List<dynamic>> FetchDetailsFromVectorSemanticLayer(
        ReadOnlyMemory<float> embedding, 
        string prompt, 
        string containerId = "")
    {
        if (string.IsNullOrWhiteSpace(containerId))
            containerId = "archaeologyCorpus";

        _container = _cosmosClient.GetContainer(databaseId, containerId);

        var queryDefinition = new QueryDefinition($@"
            SELECT Top @topN
                c.id, c.content, VectorDistance(c.vector, @embedding) as similarityScore
            FROM c
            ORDER BY VectorDistance(c.vector, @embedding)
        ");
        
        queryDefinition.WithParameter("@embedding", embedding.ToArray());
        queryDefinition.WithParameter("@topN", 3);
        queryDefinition.WithParameter("@prompt", prompt);

        var results = new List<dynamic>();
        using (var resultSetIterator = _container.GetItemQueryIterator<dynamic>(
            queryDefinition, requestOptions: _queryOptions))
        {
            while (resultSetIterator.HasMoreResults)
            {
                var response = await resultSetIterator.ReadNextAsync();
                results.AddRange(response);
            }
        }

        return results;
    }
}

What’s happening:

  • VectorDistance(c.vector, @embedding): Cosmos DB’s native vector similarity function
  • Top N retrieval: Get the 3 most semantically similar document chunks
  • Similarity score: Lower scores = higher similarity (cosine distance)
  • Dynamic results: Flexible schema for different document types

AI agents query historical knowledge through a Semantic Kernel plugin:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
public class ChatVectorSearchPlugin
{
    private readonly IServiceProvider _serviceProvider;
    private readonly ILogger<ChatVectorSearchPlugin> _logger;

    public ChatVectorSearchPlugin(IServiceProvider serviceProvider)
    {
        _serviceProvider = serviceProvider;
        _logger = serviceProvider.GetRequiredService<ILogger<ChatVectorSearchPlugin>>();
    }

    [KernelFunction("SimilaritySearchAsync")]
    [Description("Search for similarities in archaeologyCorpus based on the user query")]
    public async Task<string> SimilaritySearchAsync(
       [Description("Query prompt")]
        string prompt
    )
    {
        try
        {
            _logger.LogInformation($"Executing SimilaritySearchAsync with prompt: {prompt}");

            // Step 1: Get the vector repository
            var cosmosService = _serviceProvider.GetRequiredService<IVectorRepository>();

            // Step 2: Convert user query to embedding
            var embeddingService = _serviceProvider.GetRequiredService<ITextEmbeddingGenerationService>();
            var embeddingQuery = await embeddingService.GenerateEmbeddingAsync(prompt);
            var embeddingQueryArray = embeddingQuery.ToArray();

            _logger.LogInformation($"Generated embedding for query: {prompt}");

            // Step 3: Search for similar vectors in Cosmos DB
            var response = await cosmosService.FetchDetailsFromVectorSemanticLayer(
                embeddingQueryArray, prompt);

            _logger.LogInformation($"Retrieved {response.Count} results from vector search.");

            if (response.Count == 0)
            {
                _logger.LogWarning("No results found for the given query.");
                return "No information found!";
            }

            var serializedResponse = JsonConvert.SerializeObject(response, Formatting.Indented);
            _logger.LogInformation($"Serialized response: {serializedResponse}");

            return serializedResponse;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "An error occurred while executing SimilaritySearchAsync.");
            return "Error retrieving information";
        }
    }
}

The RAG Pattern in Action:

  1. User query: “What artifacts were found near water sources in Jordan?”
  2. Embedding generation: Query → 1536-dimensional vector
  3. Vector search: Find top 3 similar document chunks in archaeologyCorpus
  4. Context retrieval: Return relevant field report excerpts
  5. Agent reasoning: AI agent combines retrieved knowledge with satellite/LiDAR analysis

1
2
3
Agent: "I detect a rectangular anomaly at coordinates (35.4795, 31.8564)."
User: "What is it?"
Agent: "Unknown structure. Could be modern, ancient, or geological."
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Agent queries: "rectangular structures near Petra Jordan ancient"

Vector Search returns:
- Field Report 1985: "Nabataean water reservoir, rectangular design, 15m x 8m"
- Journal 2003: "Cisterns along trade routes had standardized dimensions"
- Historical Text: "Roman-period water management in Petra utilized carved rock chambers"

Agent: "High confidence this is a Nabataean-era water cistern. Similar structures 
documented 400m north (1985 excavation). Dimensions match known cistern patterns 
from Roman-Nabataean period (1st century BCE - 2nd century CE)."

The difference: Vector search transformed speculation into evidence-based analysis.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "OpenAI": {
    "Endpoint": "https://your-openai-resource.openai.azure.com/",
    "ApiKey": "your-api-key",
    "TextEmbeddingGenerationDeploymentName": "text-embedding-ada-002"
  },
  "CosmosDb": {
    "ConnectionString": "AccountEndpoint=https://...",
    "DatabaseId": "archaios",
    "ArchaeologyCorpusContainerId": "archaeologyCorpus"
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/*"
    }
  ],
  "vectorIndexes": [
    {
      "path": "/vector",
      "type": "diskANN"
    }
  ]
}

DiskANN parameters:

  • Type: diskANN (Cosmos DB’s high-performance ANN algorithm)
  • Dimensions: 1536 (matching text-embedding-ada-002 output)
  • Similarity metric: Cosine distance (default)

  • P99 latency: ~50-100ms for vector search (3-5 results)
  • Embedding generation: ~200-300ms (OpenAI API call)
  • Total RAG cycle: ~300-500ms (acceptable for real-time agent interactions)
  • DiskANN shines at scale: Handles millions of vectors with minimal latency increase
  • Cosmos DB auto-scaling: RU/s adjusts based on query load
  • Partition strategy: Use sourceDocument or site as partition key for geo-distributed queries
  • Embedding caching: Store embeddings once, query thousands of times
  • Batch processing: Vectorize documents offline during ingestion
  • Shared throughput: Use database-level provisioned throughput for predictable costs

  • Too small (50 words): Loses context
  • Too large (2000 words): Dilutes semantic signal
  • Sweet spot: 200-500 words per chunk (1-2 paragraphs)

Include metadata in embeddings:

1
2
Content to embed: "[Source: Field Report 1985, Petra, Jordan, Page 12] 
Excavation revealed pottery fragments consistent with Nabataean period..."

This helps the model understand document provenance during embedding generation.

Log:

  • Embedding generation time
  • Vector search latency
  • Similarity scores (how confident is the match?)
  • User query → retrieved documents (for quality tuning)

This isn’t just about searching documents—it’s about institutional memory. Every excavation, every field report, every expert observation becomes queryable knowledge that:

  1. Prevents redundant work: “Has this site been surveyed before?”
  2. Surfaces forgotten insights: A 1980s journal entry mentions pottery—relevant to today’s AI analysis
  3. Democratizes expertise: Junior archaeologists access decades of senior knowledge instantly
  4. Scales human intelligence: AI agents synthesize patterns across 100,000+ documents (impossible manually)

Vector search transforms data graveyards (PDFs in folders) into living knowledge graphs.


Blog 7: [Scalable Container-Based Processing Architecture] - How Azure Container Apps orchestrate LiDAR and satellite processing at scale, with event-driven auto-scaling based on upload queues.


  • Vector Repository: Archaios.AI.Infrastructure/Repositories/Cosmos/CosmosDbVectorRepository.cs
  • Search Plugin: Archaios.AI.DurableHandler/Agents/Chat/Plugins/ChatVectorSearchPlugin.cs
  • Configuration: Archaios.AI.DurableHandler/Program.cs (lines 113-118)

All code examples in this blog are from the actual Archaios production codebase—no fictional implementations.


Vector search isn’t about replacing archaeologists—it’s about giving them superhuman recall of every relevant document ever written about a site. That’s the power of semantic knowledge retrieval.