Building a Computer-Using Agent - Your Digital Twin Buddy

Series -

I built Cuata (Spanish for “buddy”) to solve a simple problem: What if you need to step away from your computer during a meeting for just a minute? Someone rings the doorbell, you get an urgent call, or you need to grab coffee. You don’t want to miss what’s being discussed, but asking teammates to catch you up later feels awkward.

Cuata is your digital twin buddy that steps in for you. It watches your screen, listens to discussions, reads slides, clicks around when needed, and when you come back, gives you a quick summary of what you missed with screenshots of key slides.

But it evolved beyond meetings. The same technology lets Cuata browse websites, read articles, search for information, summarize content, and even write summaries directly into Microsoft Wordโ€”all while acting like a human operating your computer.

This architecture diagram shows how everything fits together. At the center is Semantic Kernel acting as the brain, coordinating plugins that let Cuata see, think, and act on your computer just like you would.


Most automation tools follow rigid scripts: “Click here, type that, scroll down.” They break when the UI changes or when context matters. Cuata doesn’t work that way.

Cuata follows a recursive decision-making loop:

  1. Think: Semantic Kernel analyzes the current screen state and the task goal
  2. Select Strategy: It picks the right plugin(s) to use (Mouse, Keyboard, Chrome, Locate, Screenshot)
  3. Execute: It performs the action and validates the result
  4. Repeat: If the task isn’t complete, go back to Think

This is what makes it a Computer-Using Agent, not just a script runner. It adapts, validates, and corrects itself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
public async Task<ChatMessageContent> GetChatMessageContentAsync(
    Kernel kernel, 
    string prompt, 
    OpenAIPromptExecutionSettings? promptExecutionSettings)
{
    // Configure AI with tool-calling behavior
    OpenAIPromptExecutionSettings settings = new()
    {
        ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions,
        TopP = 1,
        Temperature = 0.7
    };

    // Let Semantic Kernel orchestrate plugin calls
    return await _chatCompletionService.GetChatMessageContentAsync(
        prompt,
        executionSettings: settings,
        kernel: kernel
    );
}

The ToolCallBehavior.AutoInvokeKernelFunctions setting tells Semantic Kernel: “You can call any plugin function you need to complete this task.” The AI decides which plugins to invoke based on the context.


Cuata has 10 plugins that give it the ability to interact with your computer:

The Mouse Plugin lets Cuata move the cursor, click, scroll, and dragโ€”just like you would.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[KernelFunction, Description("Moves the mouse to the specified screen coordinates.")]
public void MoveMouse(
    [Description("The X coordinate.")] int x, 
    [Description("The Y coordinate.")] int y, 
    [Description("The screen width.")] int screenWidth, 
    [Description("The screen height.")] int screenHeight)
{
    Console.WriteLine($"๐Ÿ–ฑ๏ธ Moving the mouse to coordinates ({x}, {y})");

    // Convert to absolute coordinates for InputSimulator
    double X = x * 65535 / screenWidth;
    double Y = y * 65535 / screenHeight;

    _inputSimulator.Mouse.MoveMouseTo(X, Y);
    Console.WriteLine($"๐Ÿ–ฑ๏ธ Moved to ({x}, {y})");
}

[KernelFunction, Description("Performs a left mouse click.")]
public void LeftClick()
{
    _inputSimulator.Mouse.LeftButtonClick();
    Console.WriteLine("๐Ÿ–ฑ๏ธ Left mouse button clicked! โœ…");
}

[KernelFunction, Description("Scrolls the mouse wheel.")]
public void Scroll(
    [Description("Positive to scroll up, negative to scroll down.")] int scrollAmount)
{
    _inputSimulator.Mouse.VerticalScroll(scrollAmount);
    Console.WriteLine($"๐Ÿ–ฑ๏ธ Scrolled {scrollAmount} clicks!");
}

Each function has a Description attribute that tells Semantic Kernel when to call it. The AI reads these descriptions and decides: “I need to click a button, so I’ll call MoveMouse and then LeftClick.”

The Keyboard Plugin types text and presses keyboard shortcuts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
[KernelFunction, Description("Types the given text into the currently focused field.")]
public void TypeText([Description("The text to type.")] string text)
{
    _inputSimulator.Keyboard.TextEntry(text);
    Console.WriteLine($"โŒจ๏ธ Typed: {text}");
}

[KernelFunction, Description("Presses Enter key.")]
public void PressEnter()
{
    _inputSimulator.Keyboard.KeyPress(VirtualKeyCode.RETURN);
    Console.WriteLine("โŒจ๏ธ Pressed Enter!");
}

[KernelFunction, Description("Selects all text using Ctrl+A.")]
public void SelectAll()
{
    _inputSimulator.Keyboard.ModifiedKeyStroke(VirtualKeyCode.CONTROL, VirtualKeyCode.VK_A);
    Console.WriteLine("โŒจ๏ธ Selected all text!");
}

Semantic Kernel uses these to fill forms, search for content, or navigate with shortcuts.

The Chrome Plugin opens URLs, navigates tabs, and controls browser behavior.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[KernelFunction, Description("Opens the specified URL in Google Chrome.")]
public string OpenUrl(string url)
{
    try
    {
        string chromePath = GetChromePath();
        Process.Start(chromePath, url);
        Console.WriteLine($"๐ŸŒ Opened: {url}");
        return $"Opened {url} in Chrome.";
    }
    catch (Exception ex)
    {
        Console.WriteLine($"โŒ Failed: {ex.Message}");
        return $"Failed to open Chrome: {ex.Message}";
    }
}

[KernelFunction, Description("Opens a new tab in Chrome using Ctrl+T.")]
public string OpenNewTab()
{
    _input.Keyboard.ModifiedKeyStroke(VirtualKeyCode.CONTROL, VirtualKeyCode.VK_T);
    Console.WriteLine("๐Ÿ†• Opened a new tab in Chrome.");
    return "Opened a new Chrome tab.";
}

[KernelFunction, Description("Goes back in Chrome using Alt+Left Arrow.")]
public string GoBack()
{
    _input.Keyboard.ModifiedKeyStroke(VirtualKeyCode.MENU, VirtualKeyCode.LEFT);
    Console.WriteLine("โฌ…๏ธ Went back in Chrome.");
    return "Navigated back in Chrome.";
}

These functions let Cuata browse the web autonomouslyโ€”opening links, navigating history, refreshing pages.

The Screenshot Plugin captures the screen and asks Azure OpenAI to analyze it. This is how Cuata “sees” what’s on the screen.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[KernelFunction, Description("Takes a screenshot and validates if an action was completed.")]
public async Task<string> ValidateScreenshot(
    [Description("The validation prompt")] string validationPrompt)
{
    // Capture screen
    ScreenCapture sc = new ScreenCapture();
    Image img = sc.CaptureScreen();
    
    using var bmp = new Bitmap(img);
    using var ms = new MemoryStream();
    bmp.Save(ms, ImageFormat.Png);
    ms.Position = 0;
    
    var imageData = new ReadOnlyMemory<byte>(ms.ToArray());

    // Send to Azure OpenAI for vision analysis
    var chatHistory = new ChatHistory();
    chatHistory.AddUserMessage(new ChatMessageContentItemCollection
    {
        new TextContent(validationPrompt),
        new ImageContent(imageData, "image/png")
    });

    var response = await _chatService.GetChatMessageContentAsync(
        chatHistory, 
        executionSettings: new OpenAIPromptExecutionSettings { MaxTokens = 500 }
    );

    Console.WriteLine($"๐Ÿ“ธ Screenshot validation: {response.Content}");
    return response.Content ?? "Validation unclear.";
}

After clicking a button, Cuata takes a screenshot and asks: “Did the button click work? Is the expected dialog open?” This creates a validation loop that ensures actions succeeded.

The Locate Plugin is the most clever. It uses Azure OCR to extract all text from the screen, find the coordinates of a specific text element, and tell the Mouse Plugin exactly where to click.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[KernelFunction, Description("Locate an element in the screenshot based on the Search text")]
public async Task<string> LocateElementInScreenshot(
    [Description("Search text to be used to locate the element")] string input)
{
    // Capture screen
    ScreenCapture sc = new ScreenCapture();
    Image img = sc.CaptureScreen();
    int screenWidth = img.Width;
    int screenHeight = img.Height;
    
    using var bmp = new Bitmap(img);
    var outputPath = Path.Combine(Directory.GetCurrentDirectory(), 
        $"locate-{DateTime.Now:yyyyMMddHHmmss}.png");
    bmp.Save(outputPath, ImageFormat.Png);

    Console.WriteLine($"๐Ÿ“ Searching for: {input}");

    // Extract all text elements using Azure OCR
    var elements = await processor.ExtractTextElementsAsync(outputPath);
    
    // Find the element that matches the search text
    int index = processor.GetTextElement(elements, input, outputPath, verbose: true);
    
    // Get the coordinates of the matched element
    var coordinates = processor.GetTextCoordinates(elements, index, outputPath);

    int x = (int)(coordinates["x"] * screenWidth);
    int y = (int)(coordinates["y"] * screenHeight);

    Console.WriteLine($"๐Ÿ–ฑ๏ธ Click at coordinates: {x}, {y}");
    
    return $"Click at coordinates: {x}, {y} and the Screen width and height are: {screenWidth}, {screenHeight}";
}

This is how Cuata clicks buttons without hardcoded coordinates. Semantic Kernel says: “Click the ‘Join Meeting’ button,” and the Locate Plugin:

  1. Takes a screenshot
  2. Sends it to Azure OCR
  3. Finds the text “Join Meeting”
  4. Returns the exact coordinates
  5. Calls Mouse Plugin to click at those coordinates

Here’s how a typical task flows through Cuata’s brain:

Task: “Join my 2 PM meeting in Microsoft Teams”

Step 1 - Think: Semantic Kernel analyzes the task and decides:

  • Need to open Teams calendar
  • Find the 2 PM meeting
  • Click the “Join” button

Step 2 - Select Strategy & Execute:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
AI Decision: "Use Chrome Plugin to open Teams calendar"
โ†’ Calls: ChromePlugin.OpenUrl("https://teams.microsoft.com/calendar")

AI Decision: "Wait for page to load, then locate '2 PM' meeting"
โ†’ Calls: ScreenshotPlugin.ValidateScreenshot("Is the Teams calendar loaded?")
โ†’ Calls: LocatePlugin.LocateElementInScreenshot("2 PM Meeting")

AI Decision: "Click the meeting link"
โ†’ Calls: MousePlugin.MoveMouse(x, y, screenWidth, screenHeight)
โ†’ Calls: MousePlugin.LeftClick()

AI Decision: "Validate that meeting details opened"
โ†’ Calls: ScreenshotPlugin.ValidateScreenshot("Is the meeting details dialog visible?")

AI Decision: "Click the 'Join' button"
โ†’ Calls: LocatePlugin.LocateElementInScreenshot("Join")
โ†’ Calls: MousePlugin.MoveMouse(x, y, screenWidth, screenHeight)
โ†’ Calls: MousePlugin.LeftClick()

AI Decision: "Validate that we're in the meeting"
โ†’ Calls: ScreenshotPlugin.ValidateScreenshot("Are we in the Teams meeting?")

Each step validates the previous action before continuing. If something fails (e.g., “Join button not found”), the AI can retry, adjust the search term, or ask for help.


Many solutions use Playwright or browser-use MCP libraries for automation. Those are great for web-only scenarios, but Cuata needs to do more:

  • Desktop Applications: Joining Teams meetings, opening Outlook, writing to Word
  • Screen-Level Interaction: Clicking on desktop dialogs, system notifications, non-web UI
  • Cross-Application Workflows: Copy from Chrome, paste into Word, send via Outlook
  • Visual Validation: Screenshot analysis to confirm actions succeeded

Playwright can’t click outside the browser. Cuata operates at the OS level using:

  • WindowsInput library for mouse/keyboard simulation
  • Azure OCR for text location
  • Azure OpenAI Vision for screen understanding
  • Semantic Kernel for intelligent orchestration

This gives full low-level control over the entire desktop environment, not just browser tabs.


  • GPT-4 Turbo with Vision: Analyzes screenshots, validates actions, plans next steps
  • Text Embedding: Converts screen content into searchable vectors
  • Extracts text from screenshots with bounding box coordinates
  • Enables the Locate Plugin to find clickable elements
  • Orchestrates long-running workflows (e.g., meeting summarization)
  • Handles multi-step processes with retries and checkpoints
  • Stores meeting summaries, user preferences, historical transcripts
  • Enables fast lookups for “What happened in my meetings this week?”
  • Stores screenshots taken during meetings
  • Archives session recordings for later review
  • Sends messages between Cuata components
  • Triggers workflows when meetings start or when user returns

When you run Cuata, it presents a menu:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
string[] options = new[]
{
    "๐Ÿ’ผ Teams App",
    "๐ŸŒ Browser App",
    "โŒ Quit"
};

// User selects an option with arrow keys
switch (selectedOption)
{
    case "๐Ÿ’ผ Teams App":
        var teamsModule = _moduleFactory.GetModule("Teams");
        await teamsModule.ExecuteAsync();
        break;
        
    case "๐ŸŒ Browser App":
        var browserModule = _moduleFactory.GetModule("Browser");
        await browserModule.ExecuteAsync();
        break;
}

Each module (Teams, Browser) implements a specific workflow using the same plugin system. The ModuleFactory injects the Semantic Kernel instance with all registered plugins, and the module orchestrates the AI decision loop.


After every action, Cuata validates the result. This is what makes it reliable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Example: Click a button and validate
await LocateAndClick("Submit");

// Validate the action succeeded
var validation = await ScreenshotPlugin.ValidateScreenshot(
    "Is the form submitted? Look for confirmation message."
);

if (validation.Contains("not submitted"))
{
    // Retry logic
    Console.WriteLine("โŒ Submission failed, retrying...");
    await LocateAndClick("Submit");
}

Without validation, automation breaks silently. With validation, Cuata can:

  • Detect failures and retry
  • Adjust its approach if the first attempt didn’t work
  • Confirm success before moving to the next step

Looking back at the architecture diagram:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     User / Cuata                            โ”‚
โ”‚  (Teams App, Browser App, OpenCV App)                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              AI Orchestration                               โ”‚
โ”‚  (Semantic Kernel: Think โ†’ Select Strategy โ†’ Execute)       โ”‚
โ”‚                                                             โ”‚
โ”‚  Plugins: Keyboard, Mouse, Locate, App Launcher             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Service Bus (Messaging Ingestor)                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Foundational Models                                        โ”‚
โ”‚  (Azure OpenAI, Azure Cosmos DB, Azure OCR)                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Everything flows through AI Orchestration. Plugins give Cuata capabilities, Semantic Kernel gives it intelligence, and Azure services give it memory and perception.


Traditional automation follows scripts. Cuata learns your context:

  • Adaptive: If the UI changes, OCR finds new button locations
  • Contextual: Screenshots provide visual understanding, not just DOM inspection
  • Validated: Every action is confirmed before proceeding
  • Conversational: You can ask Cuata to “Read this article and summarize it in Word”

It’s a buddy because it acts on your behalf when you’re not there. It doesn’t replace youโ€”it fills in for those moments when you step away, ensuring you don’t miss anything important.


โœ… Computer-Using Agents operate at the OS level, not just in browsers
โœ… Think โ†’ Select Strategy โ†’ Execute creates adaptive, self-correcting behavior
โœ… Plugin architecture keeps capabilities modular and testable
โœ… Validation loops ensure reliability (screenshot analysis after every action)
โœ… Azure OCR + OpenAI Vision enable visual understanding of screen state
โœ… Semantic Kernel orchestrates everything, deciding which plugins to call and when

In the next blog, we’ll dive into the Teams Agent workflow (how Cuata joins meetings, transcribes discussions, and summarizes what you missed) and the Browser Agent workflow (how it browses websites, reads articles, and writes summaries into Word).


Cuata: Your Digital Twin Buddy ๐Ÿค–
When you can’t be there, Cuata steps in.