Claude Controls Computers Anthropic empowers Claude Sonnet 3.5 to operate desktop apps, but cautions remain

Published

Nov 06, 2024

Reading time

3 min read

API commands for Claude Sonnet 3.5 enable Anthropic’s large language model to operate desktop apps much like humans do. Be cautious, though: It’s a work in progress.

What’s new: Anthropic launched API commands for computer use. The new commands prompt Claude Sonnet 3.5 to translate natural language instructions into commands that tell a computer to open applications, fetch data from local files, complete forms, and the like. (In addition, Anthropic improved Claude Sonnet 3.5 to achieve a state-of-the-art score on the SWE-bench Verified coding benchmark and released the faster, cheaper Claude Haiku 3.5, which likewise shows exceptional performance on coding tasks.)

How it works: The commands for computer use don’t cost extra on a per-token basis, but they may require up to 1,200 additional tokens and run repeatedly until the task at hand is accomplished, consuming more input tokens. They’re available via Anthropic, Amazon Bedrock, and Google Vertex.

Claude Sonnet 3.5 can call three new tools: Computer (which defines a computer’s screen resolution and offers access to its keyboard, mouse, and applications), Text Editor, and Bash (a terminal that runs command-line programs in various languages). The model can compose Python scripts in the text editor, run them in Bash, and store outputs in a spreadsheet.
The model tracks a computer’s state by taking screenshots. This enables it to see, for example, the contents of a spreadsheet and respond to changes such as the arrival of an email. It examines pixel locations to move the cursor, click, and enter text accordingly. An agentic loop prompts it to execute actions, observe results, and change or correct its own behavior until it completes the task at hand.
On OSWorld, a benchmark that evaluates AI models' abilities to use computers, Claude Sonnet 3.5 succeeded at about 15 percent of tasks when given 15 attempts. Cradle, the next-best system, achieved about 8 percent, and GPT-4V achieved about 7.5 percent. Human users typically complete about 72 percent.

Yes, but: The current version of computer use is experimental, and Anthropic acknowledges various limitations. The company strongly recommends using these commands only in a sandboxed environment, such as a Docker container, with limited access to the computer’s hard drive and the web to protect sensitive data and core system files. Anthropic restricts the ability to create online accounts or post to social media or other sites (but says it may lift this restriction in the future).

Behind the news: Several companies have been racing to build models that can control desktop applications. Microsoft researchers recently released OmniParser, a tool based on GPT-4V that identifies user-interface elements like windows and buttons within screenshots, potentially making it easier for agentic workflows to navigate computers. In July, Amazon hired staff and leaders from Adept, a startup that trained models to operate computer applications. (Disclosure: Andrew Ng sits on Amazon’s board of directors.) Open Interpreter is an open-source project that likewise uses a large language model to control local applications like image editors and web browsers.

Why it matters: Large multimodal models already use external tools like search engines, web browsers, calculators, calendars, databases, and email. Giving them control over a computer’s visual user interface may enable them to automate a wider range of tasks we use computers to perform, such as creating lesson plans and — more worrisome — taking academic tests.

We’re thinking: Controlling computers remains hard. For instance, using AI to read a screenshot and pick the right action to take next is very challenging. However, we’re confident that this capability will be a growth area for agentic workflows in coming years.

Subscribe to The Batch