Building a Computer-Using Agent - Your Digital Twin Buddy

Divakar Kumar included in categories AI Semantic Kernel and series Cuata

2025-11-11 2025-11-11 2288 words 11 minutes

https://github.com/Cloud-Jas/Cuata/raw/master/docs/images/DigitalTwinBuddy.png

Series -

Contents

1. What is Cuata?

I built Cuata (Spanish for “buddy”) to solve a simple problem: What if you need to step away from your computer during a meeting for just a minute? Someone rings the doorbell, you get an urgent call, or you need to grab coffee. You don’t want to miss what’s being discussed, but asking teammates to catch you up later feels awkward.

Cuata is your digital twin buddy that steps in for you. It watches your screen, listens to discussions, reads slides, clicks around when needed, and when you come back, gives you a quick summary of what you missed with screenshots of key slides.

But it evolved beyond meetings. The same technology lets Cuata browse websites, read articles, search for information, summarize content, and even write summaries directly into Microsoft Word—all while acting like a human operating your computer.

This architecture diagram shows how everything fits together. At the center is Semantic Kernel acting as the brain, coordinating plugins that let Cuata see, think, and act on your computer just like you would.

2. The Core Design: Think → Select Strategy → Execute

Most automation tools follow rigid scripts: “Click here, type that, scroll down.” They break when the UI changes or when context matters. Cuata doesn’t work that way.

Cuata follows a recursive decision-making loop:

Think: Semantic Kernel analyzes the current screen state and the task goal
Select Strategy: It picks the right plugin(s) to use (Mouse, Keyboard, Chrome, Locate, Screenshot)
Execute: It performs the action and validates the result
Repeat: If the task isn’t complete, go back to Think

This is what makes it a Computer-Using Agent, not just a script runner. It adapts, validates, and corrects itself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


public async Task<ChatMessageContent> GetChatMessageContentAsync(
    Kernel kernel, 
    string prompt, 
    OpenAIPromptExecutionSettings? promptExecutionSettings)
{
    // Configure AI with tool-calling behavior
    OpenAIPromptExecutionSettings settings = new()
    {
        ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions,
        TopP = 1,
        Temperature = 0.7
    };

    // Let Semantic Kernel orchestrate plugin calls
    return await _chatCompletionService.GetChatMessageContentAsync(
        prompt,
        executionSettings: settings,
        kernel: kernel
    );
}

The ToolCallBehavior.AutoInvokeKernelFunctions setting tells Semantic Kernel: “You can call any plugin function you need to complete this task.” The AI decides which plugins to invoke based on the context.

3. The Plugin System: Cuata’s Hands and Eyes

Cuata has 10 plugins that give it the ability to interact with your computer:

🖱️ Mouse Plugin

The Mouse Plugin lets Cuata move the cursor, click, scroll, and drag—just like you would.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


[KernelFunction, Description("Moves the mouse to the specified screen coordinates.")]
public void MoveMouse(
    [Description("The X coordinate.")] int x, 
    [Description("The Y coordinate.")] int y, 
    [Description("The screen width.")] int screenWidth, 
    [Description("The screen height.")] int screenHeight)
{
    Console.WriteLine($"🖱️ Moving the mouse to coordinates ({x}, {y})");

    // Convert to absolute coordinates for InputSimulator
    double X = x * 65535 / screenWidth;
    double Y = y * 65535 / screenHeight;

    _inputSimulator.Mouse.MoveMouseTo(X, Y);
    Console.WriteLine($"🖱️ Moved to ({x}, {y})");
}

[KernelFunction, Description("Performs a left mouse click.")]
public void LeftClick()
{
    _inputSimulator.Mouse.LeftButtonClick();
    Console.WriteLine("🖱️ Left mouse button clicked! ✅");
}

[KernelFunction, Description("Scrolls the mouse wheel.")]
public void Scroll(
    [Description("Positive to scroll up, negative to scroll down.")] int scrollAmount)
{
    _inputSimulator.Mouse.VerticalScroll(scrollAmount);
    Console.WriteLine($"🖱️ Scrolled {scrollAmount} clicks!");
}

Each function has a Description attribute that tells Semantic Kernel when to call it. The AI reads these descriptions and decides: “I need to click a button, so I’ll call MoveMouse and then LeftClick.”

⌨️ Keyboard Plugin

The Keyboard Plugin types text and presses keyboard shortcuts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


[KernelFunction, Description("Types the given text into the currently focused field.")]
public void TypeText([Description("The text to type.")] string text)
{
    _inputSimulator.Keyboard.TextEntry(text);
    Console.WriteLine($"⌨️ Typed: {text}");
}

[KernelFunction, Description("Presses Enter key.")]
public void PressEnter()
{
    _inputSimulator.Keyboard.KeyPress(VirtualKeyCode.RETURN);
    Console.WriteLine("⌨️ Pressed Enter!");
}

[KernelFunction, Description("Selects all text using Ctrl+A.")]
public void SelectAll()
{
    _inputSimulator.Keyboard.ModifiedKeyStroke(VirtualKeyCode.CONTROL, VirtualKeyCode.VK_A);
    Console.WriteLine("⌨️ Selected all text!");
}

Semantic Kernel uses these to fill forms, search for content, or navigate with shortcuts.

🌐 Chrome Plugin

The Chrome Plugin opens URLs, navigates tabs, and controls browser behavior.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


[KernelFunction, Description("Opens the specified URL in Google Chrome.")]
public string OpenUrl(string url)
{
    try
    {
        string chromePath = GetChromePath();
        Process.Start(chromePath, url);
        Console.WriteLine($"🌐 Opened: {url}");
        return $"Opened {url} in Chrome.";
    }
    catch (Exception ex)
    {
        Console.WriteLine($"❌ Failed: {ex.Message}");
        return $"Failed to open Chrome: {ex.Message}";
    }
}

[KernelFunction, Description("Opens a new tab in Chrome using Ctrl+T.")]
public string OpenNewTab()
{
    _input.Keyboard.ModifiedKeyStroke(VirtualKeyCode.CONTROL, VirtualKeyCode.VK_T);
    Console.WriteLine("🆕 Opened a new tab in Chrome.");
    return "Opened a new Chrome tab.";
}

[KernelFunction, Description("Goes back in Chrome using Alt+Left Arrow.")]
public string GoBack()
{
    _input.Keyboard.ModifiedKeyStroke(VirtualKeyCode.MENU, VirtualKeyCode.LEFT);
    Console.WriteLine("⬅️ Went back in Chrome.");
    return "Navigated back in Chrome.";
}

These functions let Cuata browse the web autonomously—opening links, navigating history, refreshing pages.

📸 Screenshot Plugin

The Screenshot Plugin captures the screen and asks Azure OpenAI to analyze it. This is how Cuata “sees” what’s on the screen.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


[KernelFunction, Description("Takes a screenshot and validates if an action was completed.")]
public async Task<string> ValidateScreenshot(
    [Description("The validation prompt")] string validationPrompt)
{
    // Capture screen
    ScreenCapture sc = new ScreenCapture();
    Image img = sc.CaptureScreen();
    
    using var bmp = new Bitmap(img);
    using var ms = new MemoryStream();
    bmp.Save(ms, ImageFormat.Png);
    ms.Position = 0;
    
    var imageData = new ReadOnlyMemory<byte>(ms.ToArray());

    // Send to Azure OpenAI for vision analysis
    var chatHistory = new ChatHistory();
    chatHistory.AddUserMessage(new ChatMessageContentItemCollection
    {
        new TextContent(validationPrompt),
        new ImageContent(imageData, "image/png")
    });

    var response = await _chatService.GetChatMessageContentAsync(
        chatHistory, 
        executionSettings: new OpenAIPromptExecutionSettings { MaxTokens = 500 }
    );

    Console.WriteLine($"📸 Screenshot validation: {response.Content}");
    return response.Content ?? "Validation unclear.";
}

After clicking a button, Cuata takes a screenshot and asks: “Did the button click work? Is the expected dialog open?” This creates a validation loop that ensures actions succeeded.

📍 Locate Plugin

The Locate Plugin is the most clever. It uses Azure OCR to extract all text from the screen, find the coordinates of a specific text element, and tell the Mouse Plugin exactly where to click.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


[KernelFunction, Description("Locate an element in the screenshot based on the Search text")]
public async Task<string> LocateElementInScreenshot(
    [Description("Search text to be used to locate the element")] string input)
{
    // Capture screen
    ScreenCapture sc = new ScreenCapture();
    Image img = sc.CaptureScreen();
    int screenWidth = img.Width;
    int screenHeight = img.Height;
    
    using var bmp = new Bitmap(img);
    var outputPath = Path.Combine(Directory.GetCurrentDirectory(), 
        $"locate-{DateTime.Now:yyyyMMddHHmmss}.png");
    bmp.Save(outputPath, ImageFormat.Png);

    Console.WriteLine($"📍 Searching for: {input}");

    // Extract all text elements using Azure OCR
    var elements = await processor.ExtractTextElementsAsync(outputPath);
    
    // Find the element that matches the search text
    int index = processor.GetTextElement(elements, input, outputPath, verbose: true);
    
    // Get the coordinates of the matched element
    var coordinates = processor.GetTextCoordinates(elements, index, outputPath);

    int x = (int)(coordinates["x"] * screenWidth);
    int y = (int)(coordinates["y"] * screenHeight);

    Console.WriteLine($"🖱️ Click at coordinates: {x}, {y}");
    
    return $"Click at coordinates: {x}, {y} and the Screen width and height are: {screenWidth}, {screenHeight}";
}

This is how Cuata clicks buttons without hardcoded coordinates. Semantic Kernel says: “Click the ‘Join Meeting’ button,” and the Locate Plugin:

Takes a screenshot
Sends it to Azure OCR
Finds the text “Join Meeting”
Returns the exact coordinates
Calls Mouse Plugin to click at those coordinates

4. The AI Orchestration Loop

Here’s how a typical task flows through Cuata’s brain:

Task: “Join my 2 PM meeting in Microsoft Teams”

Step 1 - Think: Semantic Kernel analyzes the task and decides:

Need to open Teams calendar
Find the 2 PM meeting
Click the “Join” button

Step 2 - Select Strategy & Execute:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


AI Decision: "Use Chrome Plugin to open Teams calendar"
→ Calls: ChromePlugin.OpenUrl("https://teams.microsoft.com/calendar")

AI Decision: "Wait for page to load, then locate '2 PM' meeting"
→ Calls: ScreenshotPlugin.ValidateScreenshot("Is the Teams calendar loaded?")
→ Calls: LocatePlugin.LocateElementInScreenshot("2 PM Meeting")

AI Decision: "Click the meeting link"
→ Calls: MousePlugin.MoveMouse(x, y, screenWidth, screenHeight)
→ Calls: MousePlugin.LeftClick()

AI Decision: "Validate that meeting details opened"
→ Calls: ScreenshotPlugin.ValidateScreenshot("Is the meeting details dialog visible?")

AI Decision: "Click the 'Join' button"
→ Calls: LocatePlugin.LocateElementInScreenshot("Join")
→ Calls: MousePlugin.MoveMouse(x, y, screenWidth, screenHeight)
→ Calls: MousePlugin.LeftClick()

AI Decision: "Validate that we're in the meeting"
→ Calls: ScreenshotPlugin.ValidateScreenshot("Are we in the Teams meeting?")

Each step validates the previous action before continuing. If something fails (e.g., “Join button not found”), the AI can retry, adjust the search term, or ask for help.

5. Why Not Just Use Playwright or Browser Automation?

Many solutions use Playwright or browser-use MCP libraries for automation. Those are great for web-only scenarios, but Cuata needs to do more:

Desktop Applications: Joining Teams meetings, opening Outlook, writing to Word
Screen-Level Interaction: Clicking on desktop dialogs, system notifications, non-web UI
Cross-Application Workflows: Copy from Chrome, paste into Word, send via Outlook
Visual Validation: Screenshot analysis to confirm actions succeeded

Playwright can’t click outside the browser. Cuata operates at the OS level using:

WindowsInput library for mouse/keyboard simulation
Azure OCR for text location
Azure OpenAI Vision for screen understanding
Semantic Kernel for intelligent orchestration

This gives full low-level control over the entire desktop environment, not just browser tabs.

6. The Azure Services Behind Cuata

🗣️ Azure OpenAI

GPT-4 Turbo with Vision: Analyzes screenshots, validates actions, plans next steps
Text Embedding: Converts screen content into searchable vectors

🔍 Azure Cognitive Services Vision (OCR)

Extracts text from screenshots with bounding box coordinates
Enables the Locate Plugin to find clickable elements

📂 Azure Durable Functions

Orchestrates long-running workflows (e.g., meeting summarization)
Handles multi-step processes with retries and checkpoints

🛢️ Azure Cosmos DB

Stores meeting summaries, user preferences, historical transcripts
Enables fast lookups for “What happened in my meetings this week?”

📦 Azure Blob Storage

Stores screenshots taken during meetings
Archives session recordings for later review

📨 Azure Service Bus

Sends messages between Cuata components
Triggers workflows when meetings start or when user returns

7. The Main Application Loop

When you run Cuata, it presents a menu:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


string[] options = new[]
{
    "💼 Teams App",
    "🌐 Browser App",
    "❌ Quit"
};

// User selects an option with arrow keys
switch (selectedOption)
{
    case "💼 Teams App":
        var teamsModule = _moduleFactory.GetModule("Teams");
        await teamsModule.ExecuteAsync();
        break;
        
    case "🌐 Browser App":
        var browserModule = _moduleFactory.GetModule("Browser");
        await browserModule.ExecuteAsync();
        break;
}

Each module (Teams, Browser) implements a specific workflow using the same plugin system. The ModuleFactory injects the Semantic Kernel instance with all registered plugins, and the module orchestrates the AI decision loop.

8. Validation: The Secret Sauce

After every action, Cuata validates the result. This is what makes it reliable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


// Example: Click a button and validate
await LocateAndClick("Submit");

// Validate the action succeeded
var validation = await ScreenshotPlugin.ValidateScreenshot(
    "Is the form submitted? Look for confirmation message."
);

if (validation.Contains("not submitted"))
{
    // Retry logic
    Console.WriteLine("❌ Submission failed, retrying...");
    await LocateAndClick("Submit");
}

Without validation, automation breaks silently. With validation, Cuata can:

Detect failures and retry
Adjust its approach if the first attempt didn’t work
Confirm success before moving to the next step

9. The Architecture at a Glance

Looking back at the architecture diagram:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


┌─────────────────────────────────────────────────────────────┐
│                     User / Cuata                            │
│  (Teams App, Browser App, OpenCV App)                       │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────┐
│              AI Orchestration                               │
│  (Semantic Kernel: Think → Select Strategy → Execute)       │
│                                                             │
│  Plugins: Keyboard, Mouse, Locate, App Launcher             │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────┐
│  Service Bus (Messaging Ingestor)                           │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────┐
│  Foundational Models                                        │
│  (Azure OpenAI, Azure Cosmos DB, Azure OCR)                 │
└─────────────────────────────────────────────────────────────┘

Everything flows through AI Orchestration. Plugins give Cuata capabilities, Semantic Kernel gives it intelligence, and Azure services give it memory and perception.

10. What Makes This a “Digital Twin Buddy”?

Traditional automation follows scripts. Cuata learns your context:

Adaptive: If the UI changes, OCR finds new button locations
Contextual: Screenshots provide visual understanding, not just DOM inspection
Validated: Every action is confirmed before proceeding
Conversational: You can ask Cuata to “Read this article and summarize it in Word”

It’s a buddy because it acts on your behalf when you’re not there. It doesn’t replace you—it fills in for those moments when you step away, ensuring you don’t miss anything important.

11. Key Takeaways

✅ Computer-Using Agents operate at the OS level, not just in browsers
✅ Think → Select Strategy → Execute creates adaptive, self-correcting behavior
✅ Plugin architecture keeps capabilities modular and testable
✅ Validation loops ensure reliability (screenshot analysis after every action)
✅ Azure OCR + OpenAI Vision enable visual understanding of screen state
✅ Semantic Kernel orchestrates everything, deciding which plugins to call and when

In the next blog, we’ll dive into the Teams Agent workflow (how Cuata joins meetings, transcribes discussions, and summarizes what you missed) and the Browser Agent workflow (how it browses websites, reads articles, and writes summaries into Word).

Cuata: Your Digital Twin Buddy 🤖
When you can’t be there, Cuata steps in.