I built Cuata (Spanish for “buddy”) to solve a simple problem: What if you need to step away from your computer during a meeting for just a minute? Someone rings the doorbell, you get an urgent call, or you need to grab coffee. You don’t want to miss what’s being discussed, but asking teammates to catch you up later feels awkward.
Cuata is your digital twin buddy that steps in for you. It watches your screen, listens to discussions, reads slides, clicks around when needed, and when you come back, gives you a quick summary of what you missed with screenshots of key slides.
But it evolved beyond meetings. The same technology lets Cuata browse websites, read articles, search for information, summarize content, and even write summaries directly into Microsoft Wordโall while acting like a human operating your computer.
This architecture diagram shows how everything fits together. At the center is Semantic Kernel acting as the brain, coordinating plugins that let Cuata see, think, and act on your computer just like you would.
Most automation tools follow rigid scripts: “Click here, type that, scroll down.” They break when the UI changes or when context matters. Cuata doesn’t work that way.
Cuata follows a recursive decision-making loop:
- Think: Semantic Kernel analyzes the current screen state and the task goal
- Select Strategy: It picks the right plugin(s) to use (Mouse, Keyboard, Chrome, Locate, Screenshot)
- Execute: It performs the action and validates the result
- Repeat: If the task isn’t complete, go back to Think
This is what makes it a Computer-Using Agent, not just a script runner. It adapts, validates, and corrects itself.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
public async Task<ChatMessageContent> GetChatMessageContentAsync(
Kernel kernel,
string prompt,
OpenAIPromptExecutionSettings? promptExecutionSettings)
{
// Configure AI with tool-calling behavior
OpenAIPromptExecutionSettings settings = new()
{
ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions,
TopP = 1,
Temperature = 0.7
};
// Let Semantic Kernel orchestrate plugin calls
return await _chatCompletionService.GetChatMessageContentAsync(
prompt,
executionSettings: settings,
kernel: kernel
);
}
|
The ToolCallBehavior.AutoInvokeKernelFunctions setting tells Semantic Kernel: “You can call any plugin function you need to complete this task.” The AI decides which plugins to invoke based on the context.
3. The Plugin System: Cuata’s Hands and Eyes
Cuata has 10 plugins that give it the ability to interact with your computer:
The Mouse Plugin lets Cuata move the cursor, click, scroll, and dragโjust like you would.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
[KernelFunction, Description("Moves the mouse to the specified screen coordinates.")]
public void MoveMouse(
[Description("The X coordinate.")] int x,
[Description("The Y coordinate.")] int y,
[Description("The screen width.")] int screenWidth,
[Description("The screen height.")] int screenHeight)
{
Console.WriteLine($"๐ฑ๏ธ Moving the mouse to coordinates ({x}, {y})");
// Convert to absolute coordinates for InputSimulator
double X = x * 65535 / screenWidth;
double Y = y * 65535 / screenHeight;
_inputSimulator.Mouse.MoveMouseTo(X, Y);
Console.WriteLine($"๐ฑ๏ธ Moved to ({x}, {y})");
}
[KernelFunction, Description("Performs a left mouse click.")]
public void LeftClick()
{
_inputSimulator.Mouse.LeftButtonClick();
Console.WriteLine("๐ฑ๏ธ Left mouse button clicked! โ
");
}
[KernelFunction, Description("Scrolls the mouse wheel.")]
public void Scroll(
[Description("Positive to scroll up, negative to scroll down.")] int scrollAmount)
{
_inputSimulator.Mouse.VerticalScroll(scrollAmount);
Console.WriteLine($"๐ฑ๏ธ Scrolled {scrollAmount} clicks!");
}
|
Each function has a Description attribute that tells Semantic Kernel when to call it. The AI reads these descriptions and decides: “I need to click a button, so I’ll call MoveMouse and then LeftClick.”
The Keyboard Plugin types text and presses keyboard shortcuts.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
[KernelFunction, Description("Types the given text into the currently focused field.")]
public void TypeText([Description("The text to type.")] string text)
{
_inputSimulator.Keyboard.TextEntry(text);
Console.WriteLine($"โจ๏ธ Typed: {text}");
}
[KernelFunction, Description("Presses Enter key.")]
public void PressEnter()
{
_inputSimulator.Keyboard.KeyPress(VirtualKeyCode.RETURN);
Console.WriteLine("โจ๏ธ Pressed Enter!");
}
[KernelFunction, Description("Selects all text using Ctrl+A.")]
public void SelectAll()
{
_inputSimulator.Keyboard.ModifiedKeyStroke(VirtualKeyCode.CONTROL, VirtualKeyCode.VK_A);
Console.WriteLine("โจ๏ธ Selected all text!");
}
|
Semantic Kernel uses these to fill forms, search for content, or navigate with shortcuts.
The Chrome Plugin opens URLs, navigates tabs, and controls browser behavior.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
[KernelFunction, Description("Opens the specified URL in Google Chrome.")]
public string OpenUrl(string url)
{
try
{
string chromePath = GetChromePath();
Process.Start(chromePath, url);
Console.WriteLine($"๐ Opened: {url}");
return $"Opened {url} in Chrome.";
}
catch (Exception ex)
{
Console.WriteLine($"โ Failed: {ex.Message}");
return $"Failed to open Chrome: {ex.Message}";
}
}
[KernelFunction, Description("Opens a new tab in Chrome using Ctrl+T.")]
public string OpenNewTab()
{
_input.Keyboard.ModifiedKeyStroke(VirtualKeyCode.CONTROL, VirtualKeyCode.VK_T);
Console.WriteLine("๐ Opened a new tab in Chrome.");
return "Opened a new Chrome tab.";
}
[KernelFunction, Description("Goes back in Chrome using Alt+Left Arrow.")]
public string GoBack()
{
_input.Keyboard.ModifiedKeyStroke(VirtualKeyCode.MENU, VirtualKeyCode.LEFT);
Console.WriteLine("โฌ
๏ธ Went back in Chrome.");
return "Navigated back in Chrome.";
}
|
These functions let Cuata browse the web autonomouslyโopening links, navigating history, refreshing pages.
The Screenshot Plugin captures the screen and asks Azure OpenAI to analyze it. This is how Cuata “sees” what’s on the screen.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
[KernelFunction, Description("Takes a screenshot and validates if an action was completed.")]
public async Task<string> ValidateScreenshot(
[Description("The validation prompt")] string validationPrompt)
{
// Capture screen
ScreenCapture sc = new ScreenCapture();
Image img = sc.CaptureScreen();
using var bmp = new Bitmap(img);
using var ms = new MemoryStream();
bmp.Save(ms, ImageFormat.Png);
ms.Position = 0;
var imageData = new ReadOnlyMemory<byte>(ms.ToArray());
// Send to Azure OpenAI for vision analysis
var chatHistory = new ChatHistory();
chatHistory.AddUserMessage(new ChatMessageContentItemCollection
{
new TextContent(validationPrompt),
new ImageContent(imageData, "image/png")
});
var response = await _chatService.GetChatMessageContentAsync(
chatHistory,
executionSettings: new OpenAIPromptExecutionSettings { MaxTokens = 500 }
);
Console.WriteLine($"๐ธ Screenshot validation: {response.Content}");
return response.Content ?? "Validation unclear.";
}
|
After clicking a button, Cuata takes a screenshot and asks: “Did the button click work? Is the expected dialog open?” This creates a validation loop that ensures actions succeeded.
The Locate Plugin is the most clever. It uses Azure OCR to extract all text from the screen, find the coordinates of a specific text element, and tell the Mouse Plugin exactly where to click.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
[KernelFunction, Description("Locate an element in the screenshot based on the Search text")]
public async Task<string> LocateElementInScreenshot(
[Description("Search text to be used to locate the element")] string input)
{
// Capture screen
ScreenCapture sc = new ScreenCapture();
Image img = sc.CaptureScreen();
int screenWidth = img.Width;
int screenHeight = img.Height;
using var bmp = new Bitmap(img);
var outputPath = Path.Combine(Directory.GetCurrentDirectory(),
$"locate-{DateTime.Now:yyyyMMddHHmmss}.png");
bmp.Save(outputPath, ImageFormat.Png);
Console.WriteLine($"๐ Searching for: {input}");
// Extract all text elements using Azure OCR
var elements = await processor.ExtractTextElementsAsync(outputPath);
// Find the element that matches the search text
int index = processor.GetTextElement(elements, input, outputPath, verbose: true);
// Get the coordinates of the matched element
var coordinates = processor.GetTextCoordinates(elements, index, outputPath);
int x = (int)(coordinates["x"] * screenWidth);
int y = (int)(coordinates["y"] * screenHeight);
Console.WriteLine($"๐ฑ๏ธ Click at coordinates: {x}, {y}");
return $"Click at coordinates: {x}, {y} and the Screen width and height are: {screenWidth}, {screenHeight}";
}
|
This is how Cuata clicks buttons without hardcoded coordinates. Semantic Kernel says: “Click the ‘Join Meeting’ button,” and the Locate Plugin:
- Takes a screenshot
- Sends it to Azure OCR
- Finds the text “Join Meeting”
- Returns the exact coordinates
- Calls Mouse Plugin to click at those coordinates
Here’s how a typical task flows through Cuata’s brain:
Task: “Join my 2 PM meeting in Microsoft Teams”
Step 1 - Think: Semantic Kernel analyzes the task and decides:
- Need to open Teams calendar
- Find the 2 PM meeting
- Click the “Join” button
Step 2 - Select Strategy & Execute:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
AI Decision: "Use Chrome Plugin to open Teams calendar"
โ Calls: ChromePlugin.OpenUrl("https://teams.microsoft.com/calendar")
AI Decision: "Wait for page to load, then locate '2 PM' meeting"
โ Calls: ScreenshotPlugin.ValidateScreenshot("Is the Teams calendar loaded?")
โ Calls: LocatePlugin.LocateElementInScreenshot("2 PM Meeting")
AI Decision: "Click the meeting link"
โ Calls: MousePlugin.MoveMouse(x, y, screenWidth, screenHeight)
โ Calls: MousePlugin.LeftClick()
AI Decision: "Validate that meeting details opened"
โ Calls: ScreenshotPlugin.ValidateScreenshot("Is the meeting details dialog visible?")
AI Decision: "Click the 'Join' button"
โ Calls: LocatePlugin.LocateElementInScreenshot("Join")
โ Calls: MousePlugin.MoveMouse(x, y, screenWidth, screenHeight)
โ Calls: MousePlugin.LeftClick()
AI Decision: "Validate that we're in the meeting"
โ Calls: ScreenshotPlugin.ValidateScreenshot("Are we in the Teams meeting?")
|
Each step validates the previous action before continuing. If something fails (e.g., “Join button not found”), the AI can retry, adjust the search term, or ask for help.
Many solutions use Playwright or browser-use MCP libraries for automation. Those are great for web-only scenarios, but Cuata needs to do more:
- Desktop Applications: Joining Teams meetings, opening Outlook, writing to Word
- Screen-Level Interaction: Clicking on desktop dialogs, system notifications, non-web UI
- Cross-Application Workflows: Copy from Chrome, paste into Word, send via Outlook
- Visual Validation: Screenshot analysis to confirm actions succeeded
Playwright can’t click outside the browser. Cuata operates at the OS level using:
WindowsInput library for mouse/keyboard simulation
- Azure OCR for text location
- Azure OpenAI Vision for screen understanding
- Semantic Kernel for intelligent orchestration
This gives full low-level control over the entire desktop environment, not just browser tabs.
- GPT-4 Turbo with Vision: Analyzes screenshots, validates actions, plans next steps
- Text Embedding: Converts screen content into searchable vectors
- Extracts text from screenshots with bounding box coordinates
- Enables the Locate Plugin to find clickable elements
- Orchestrates long-running workflows (e.g., meeting summarization)
- Handles multi-step processes with retries and checkpoints
- Stores meeting summaries, user preferences, historical transcripts
- Enables fast lookups for “What happened in my meetings this week?”
- Stores screenshots taken during meetings
- Archives session recordings for later review
- Sends messages between Cuata components
- Triggers workflows when meetings start or when user returns
7. The Main Application Loop
When you run Cuata, it presents a menu:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
string[] options = new[]
{
"๐ผ Teams App",
"๐ Browser App",
"โ Quit"
};
// User selects an option with arrow keys
switch (selectedOption)
{
case "๐ผ Teams App":
var teamsModule = _moduleFactory.GetModule("Teams");
await teamsModule.ExecuteAsync();
break;
case "๐ Browser App":
var browserModule = _moduleFactory.GetModule("Browser");
await browserModule.ExecuteAsync();
break;
}
|
Each module (Teams, Browser) implements a specific workflow using the same plugin system. The ModuleFactory injects the Semantic Kernel instance with all registered plugins, and the module orchestrates the AI decision loop.
After every action, Cuata validates the result. This is what makes it reliable.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
// Example: Click a button and validate
await LocateAndClick("Submit");
// Validate the action succeeded
var validation = await ScreenshotPlugin.ValidateScreenshot(
"Is the form submitted? Look for confirmation message."
);
if (validation.Contains("not submitted"))
{
// Retry logic
Console.WriteLine("โ Submission failed, retrying...");
await LocateAndClick("Submit");
}
|
Without validation, automation breaks silently. With validation, Cuata can:
- Detect failures and retry
- Adjust its approach if the first attempt didn’t work
- Confirm success before moving to the next step
Looking back at the architecture diagram:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ User / Cuata โ
โ (Teams App, Browser App, OpenCV App) โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AI Orchestration โ
โ (Semantic Kernel: Think โ Select Strategy โ Execute) โ
โ โ
โ Plugins: Keyboard, Mouse, Locate, App Launcher โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Service Bus (Messaging Ingestor) โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Foundational Models โ
โ (Azure OpenAI, Azure Cosmos DB, Azure OCR) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
Everything flows through AI Orchestration. Plugins give Cuata capabilities, Semantic Kernel gives it intelligence, and Azure services give it memory and perception.
Traditional automation follows scripts. Cuata learns your context:
- Adaptive: If the UI changes, OCR finds new button locations
- Contextual: Screenshots provide visual understanding, not just DOM inspection
- Validated: Every action is confirmed before proceeding
- Conversational: You can ask Cuata to “Read this article and summarize it in Word”
It’s a buddy because it acts on your behalf when you’re not there. It doesn’t replace youโit fills in for those moments when you step away, ensuring you don’t miss anything important.
โ
Computer-Using Agents operate at the OS level, not just in browsers
โ
Think โ Select Strategy โ Execute creates adaptive, self-correcting behavior
โ
Plugin architecture keeps capabilities modular and testable
โ
Validation loops ensure reliability (screenshot analysis after every action)
โ
Azure OCR + OpenAI Vision enable visual understanding of screen state
โ
Semantic Kernel orchestrates everything, deciding which plugins to call and when
In the next blog, we’ll dive into the Teams Agent workflow (how Cuata joins meetings, transcribes discussions, and summarizes what you missed) and the Browser Agent workflow (how it browses websites, reads articles, and writes summaries into Word).
Cuata: Your Digital Twin Buddy ๐ค
When you can’t be there, Cuata steps in.