Building a Computer-Using Agent - Your Digital Twin Buddy

1. What is Cuata?
I built Cuata (Spanish for “buddy”) to solve a simple problem: What if you need to step away from your computer during a meeting for just a minute? Someone rings the doorbell, you get an urgent call, or you need to grab coffee. You don’t want to miss what’s being discussed, but asking teammates to catch you up later feels awkward.
Cuata is your digital twin buddy that steps in for you. It watches your screen, listens to discussions, reads slides, clicks around when needed, and when you come back, gives you a quick summary of what you missed with screenshots of key slides.
But it evolved beyond meetings. The same technology lets Cuata browse websites, read articles, search for information, summarize content, and even write summaries directly into Microsoft Word—all while acting like a human operating your computer.
This architecture diagram shows how everything fits together. At the center is Semantic Kernel acting as the brain, coordinating plugins that let Cuata see, think, and act on your computer just like you would.
2. The Core Design: Think → Select Strategy → Execute
Most automation tools follow rigid scripts: “Click here, type that, scroll down.” They break when the UI changes or when context matters. Cuata doesn’t work that way.
Cuata follows a recursive decision-making loop:
- Think: Semantic Kernel analyzes the current screen state and the task goal
- Select Strategy: It picks the right plugin(s) to use (Mouse, Keyboard, Chrome, Locate, Screenshot)
- Execute: It performs the action and validates the result
- Repeat: If the task isn’t complete, go back to Think
This is what makes it a Computer-Using Agent, not just a script runner. It adapts, validates, and corrects itself.
|
|
The ToolCallBehavior.AutoInvokeKernelFunctions setting tells Semantic Kernel: “You can call any plugin function you need to complete this task.” The AI decides which plugins to invoke based on the context.
3. The Plugin System: Cuata’s Hands and Eyes
Cuata has 10 plugins that give it the ability to interact with your computer:
🖱️ Mouse Plugin
The Mouse Plugin lets Cuata move the cursor, click, scroll, and drag—just like you would.
|
|
Each function has a Description attribute that tells Semantic Kernel when to call it. The AI reads these descriptions and decides: “I need to click a button, so I’ll call MoveMouse and then LeftClick.”
⌨️ Keyboard Plugin
The Keyboard Plugin types text and presses keyboard shortcuts.
|
|
Semantic Kernel uses these to fill forms, search for content, or navigate with shortcuts.
🌐 Chrome Plugin
The Chrome Plugin opens URLs, navigates tabs, and controls browser behavior.
|
|
These functions let Cuata browse the web autonomously—opening links, navigating history, refreshing pages.
📸 Screenshot Plugin
The Screenshot Plugin captures the screen and asks Azure OpenAI to analyze it. This is how Cuata “sees” what’s on the screen.
|
|
After clicking a button, Cuata takes a screenshot and asks: “Did the button click work? Is the expected dialog open?” This creates a validation loop that ensures actions succeeded.
📍 Locate Plugin
The Locate Plugin is the most clever. It uses Azure OCR to extract all text from the screen, find the coordinates of a specific text element, and tell the Mouse Plugin exactly where to click.
|
|
This is how Cuata clicks buttons without hardcoded coordinates. Semantic Kernel says: “Click the ‘Join Meeting’ button,” and the Locate Plugin:
- Takes a screenshot
- Sends it to Azure OCR
- Finds the text “Join Meeting”
- Returns the exact coordinates
- Calls Mouse Plugin to click at those coordinates
4. The AI Orchestration Loop
Here’s how a typical task flows through Cuata’s brain:
Task: “Join my 2 PM meeting in Microsoft Teams”
Step 1 - Think: Semantic Kernel analyzes the task and decides:
- Need to open Teams calendar
- Find the 2 PM meeting
- Click the “Join” button
Step 2 - Select Strategy & Execute:
|
|
Each step validates the previous action before continuing. If something fails (e.g., “Join button not found”), the AI can retry, adjust the search term, or ask for help.
5. Why Not Just Use Playwright or Browser Automation?
Many solutions use Playwright or browser-use MCP libraries for automation. Those are great for web-only scenarios, but Cuata needs to do more:
- Desktop Applications: Joining Teams meetings, opening Outlook, writing to Word
- Screen-Level Interaction: Clicking on desktop dialogs, system notifications, non-web UI
- Cross-Application Workflows: Copy from Chrome, paste into Word, send via Outlook
- Visual Validation: Screenshot analysis to confirm actions succeeded
Playwright can’t click outside the browser. Cuata operates at the OS level using:
WindowsInputlibrary for mouse/keyboard simulation- Azure OCR for text location
- Azure OpenAI Vision for screen understanding
- Semantic Kernel for intelligent orchestration
This gives full low-level control over the entire desktop environment, not just browser tabs.
6. The Azure Services Behind Cuata
🗣️ Azure OpenAI
- GPT-4 Turbo with Vision: Analyzes screenshots, validates actions, plans next steps
- Text Embedding: Converts screen content into searchable vectors
🔍 Azure Cognitive Services Vision (OCR)
- Extracts text from screenshots with bounding box coordinates
- Enables the Locate Plugin to find clickable elements
📂 Azure Durable Functions
- Orchestrates long-running workflows (e.g., meeting summarization)
- Handles multi-step processes with retries and checkpoints
🛢️ Azure Cosmos DB
- Stores meeting summaries, user preferences, historical transcripts
- Enables fast lookups for “What happened in my meetings this week?”
📦 Azure Blob Storage
- Stores screenshots taken during meetings
- Archives session recordings for later review
📨 Azure Service Bus
- Sends messages between Cuata components
- Triggers workflows when meetings start or when user returns
7. The Main Application Loop
When you run Cuata, it presents a menu:
|
|
Each module (Teams, Browser) implements a specific workflow using the same plugin system. The ModuleFactory injects the Semantic Kernel instance with all registered plugins, and the module orchestrates the AI decision loop.
8. Validation: The Secret Sauce
After every action, Cuata validates the result. This is what makes it reliable.
|
|
Without validation, automation breaks silently. With validation, Cuata can:
- Detect failures and retry
- Adjust its approach if the first attempt didn’t work
- Confirm success before moving to the next step
9. The Architecture at a Glance
Looking back at the architecture diagram:
|
|
Everything flows through AI Orchestration. Plugins give Cuata capabilities, Semantic Kernel gives it intelligence, and Azure services give it memory and perception.
10. What Makes This a “Digital Twin Buddy”?
Traditional automation follows scripts. Cuata learns your context:
- Adaptive: If the UI changes, OCR finds new button locations
- Contextual: Screenshots provide visual understanding, not just DOM inspection
- Validated: Every action is confirmed before proceeding
- Conversational: You can ask Cuata to “Read this article and summarize it in Word”
It’s a buddy because it acts on your behalf when you’re not there. It doesn’t replace you—it fills in for those moments when you step away, ensuring you don’t miss anything important.
11. Key Takeaways
✅ Computer-Using Agents operate at the OS level, not just in browsers
✅ Think → Select Strategy → Execute creates adaptive, self-correcting behavior
✅ Plugin architecture keeps capabilities modular and testable
✅ Validation loops ensure reliability (screenshot analysis after every action)
✅ Azure OCR + OpenAI Vision enable visual understanding of screen state
✅ Semantic Kernel orchestrates everything, deciding which plugins to call and when
In the next blog, we’ll dive into the Teams Agent workflow (how Cuata joins meetings, transcribes discussions, and summarizes what you missed) and the Browser Agent workflow (how it browses websites, reads articles, and writes summaries into Word).
Cuata: Your Digital Twin Buddy 🤖
When you can’t be there, Cuata steps in.
