Training for Computer Use UI-TARS shows strong computer use capabilities in benchmarks

Published

Feb 05, 2025

Reading time

3 min read

As Anthropic, Google, OpenAI, and others roll out agents that are capable of computer use, new work shows how underlying models can be trained to do this.

What’s new: Yujian Qin and colleagues at ByteDance and Tsinghua University introduced UI-TARS, a fine-tuned version of the vision-language model Qwen2-VL that uses lines of reasoning to decide which mouse clicks, keyboard presses, and other actions to take in desktop and mobile apps. The model’s weights are licensed freely for commercial and noncommercial uses via Apache 2.0. You can download them here.

The authors added chains of thought (CoTs) to their training set of screenshots and actions by prompting an unspecified vision-language model to explain the current action given previous screenshots, actions, and generated CoTs. Sometimes that process led to bad explanations, so they also generated multiple CoTs and actions for a given screenshot and selected the CoT that led to the correct action.
They fine-tuned UI-TARS to generate a CoT and action from an instruction (such as “Open the document and add the text ‘hello’”) plus the screenshots, CoTs, and actions so far.
They ran UI-TARS within a virtual PC, generating a large number of screenshots, CoTs, and actions so far. They filtered out erroneous CoTs and actions using rules (such as removing those that included redundant actions), scoring outputs automatically and removing those with low scores, and reviewing them manually. They fine-tuned the model on the remaining outputs and repeatedly generated, filtered, and fine-tuned.
They also fine-tuned the model on corrected examples of erroneous CoTs and actions. Human annotators corrected the CoT and action to (a) avoid the error and (b) fix the error after it occurred.
Finally, they fine-tuned the model using Direct Preference Optimization (DPO) to prefer generating the corrected examples over the erroneous examples from the previous step.
At inference, given a screenshot, an instruction, and potential actions (as is typical with open computer use models; the authors provide a handy list in a sample prompt), UI-TARS generated a CoT and an action to take. After taking that action (via PyAutoGUI, a Python module that controls computers), the model received a new screenshot and generated another chain of thought and action, and so on. At each step, the model produced a new chain of thought and action, taking into account the instruction and all CoTs, actions, and screenshots so far.

Behind the news: Adept touted computer use in early 2022, and OmniParser Aguvis soon followed with practical implementations. In October 2024, Anthropic set off the current wave of model/app interaction with its announcement of computer use for Claude 3.5 Sonnet. OpenAI recently responded with Operator, its own foray into using vision and language models to control computers.

Results: UI-TARS matched or outperformed Claude 3.5 Sonnet with computer use, GPT-4o with various computer use frameworks, and the Aguvis framework with its native model on 11 benchmarks. On OSWorld, which asks models to perform tasks using a variety of real-world applications and operating systems, UI-TARS successfully completed 22.7 percent of the tasks in 15 steps, whereas Claude 3.5 Sonnet with computer use completed 14.9 percent, GPT-4o with Aguvis 17 percent, and Aguvis with its native model 10.3 percent.

Why it matters: Training a model to take good actions enables it to perform well. Training it to correct its mistakes after making them enables it to recover from unexpected issues that may occur in the real world.

We’re thinking: Since computer use can be simulated in a virtual machine, it’s possible to generate massive amounts of training data automatically. This is bound to spur rapid progress in computer use by large language models.

Subscribe to The Batch