Prompts & Agents

Computer Use

An LLM's ability to control a desktop environment with screenshots, mouse and keyboard input.

In common use since 2024

Computer Use is the capability that lets an LLM control a desktop environment by taking screenshots, reasoning about what it sees, and emitting mouse and keyboard actions. Anthropic introduced the term with their late-2024 release of Claude Computer Use; OpenAI shipped Operator with similar capabilities; Google integrated comparable features in Gemini and Project Mariner. By 2026 Computer Use is a mainstream agent capability with active production deployments.

How it works under the hood:

  1. The agent runtime takes a screenshot of the controlled environment (a sandboxed Linux desktop, a virtual browser, sometimes the user's actual machine with consent).
  2. The screenshot is sent to a vision-capable LLM along with the user's goal and the conversation history.
  3. The LLM emits an action: "click at coordinates (340, 220)", "type 'invoice 2026-04'", "scroll down 500px", "open new tab".
  4. The runtime executes the action and takes another screenshot.
  5. Loop until the goal is achieved or the agent gives up.

The 2026 use cases that have stabilised:

  • Workflow automation in legacy systems — controlling old enterprise software that has no API.
  • Form filling — government, vendor, healthcare, expense reports.
  • Personal task automation — book travel, manage subscriptions, schedule appointments.
  • End-to-end testing — natural-language test scenarios driven through a real browser.
  • Customer support escalation — agents that can navigate internal tools the way a human support rep would.
  • Research and data collection — sites that block traditional scraping but allow browser sessions.

The hard problems:

  • Reliability — visual reasoning is fragile. UI changes, modals, loading states and pop-ups all break naive agents. Production-grade Computer Use needs robust retry, recovery and human handoff.
  • Speed — every action requires a vision LLM call; minute-scale workflows are typical. Background execution is more common than real-time.
  • Cost — vision tokens are expensive; long sessions of dozens of screenshots add up to dollars.
  • Security — an agent that can click anywhere can also click "delete" anywhere. Sandboxing and permission boundaries are non-negotiable.
  • Prompt injection — malicious content visible on screen can hijack the agent's behaviour, just as text-based prompt injection does in chat.

For a US team in 2026, Computer Use is appropriate when there is no better alternative and the task is consequential enough to justify the cost. APIs beat browser automation; browser automation beats Computer Use. But for the tasks where APIs and DOM-based automation both fail, Computer Use has earned its place as a real production capability rather than a research curiosity. The mature deployments treat the agent as a sandboxed worker with a checkpoint-and-approval gate before any irreversible action.

Keep exploring

Looking for something else? The full glossary covers 120+ AI terms updated for 2026.

Open the glossary
Chat on WhatsApp