Browsing the web to achieve a specific goal can be challenging for agents based on large language models and even for vision-language models that can process onscreen images of a browser. While some approaches address this difficulty in training the underlying model, the agent architecture can also make a difference.
What’s new: Jing Yu Koh and colleagues at Carnegie Mellon University introduced tree search for language model agents, a method that allows agents to treat web interactions like tree searches. In this way, agents can explore possible chains of actions and avoid repeating mistakes.
Key insight: Some web tasks, for instance finding a price of a particular item, require a chain of intermediate actions: navigating to the right page, scrolling to find the item, matching an image of the item to the image on the page, and so on. If an agent clicks the wrong link during this process, it might lose its way. The ability to evaluate possible actions and remember previous states of web pages can help an agent correct its mistakes and choose a chain of actions that achieves its goal.
How it works: An agent based on GPT-4o attempted 200 tasks using website mockups that mimicked an online retail store, Reddit-like forum, and directory of classified ads. The tasks included ordering an item to be delivered to a given address, finding specific images on the forum, and posting an ad. The authors annotated each web page using the method called Set of Mark, which identifies every visual element capable of interaction with a bounding box and a numerical ID.
- The agent started with a web page and an instruction such as, “Tell me the number of reviews our store received that mention the term ‘not useful.’” It passed an image of the page to the LLM, which predicted five actions that could make progress toward completing the task such as scrolling up or down, hovering over an element, clicking, typing in a text field, or opening a new URL.
- The agent executed the five actions. After each one, the LLM assessed the current state of the page using the previous states as context. The assessment assigned a value between 0 and 1 (meaning the task was complete). The agent kept a list of page states and their values.
- The agent selected the web page state with the highest value after executing the five actions, and repeated the process, making a new set of five predictions based on the highest-value state.
- This process is a search: The agent executed a chain of actions until the value of the new states dropped below the values of other states. If all new states had lower values, the agent backtracked to a previous state with a higher value and asked the LLM for five more actions. The search stopped when the agent had completed the task or explored 20 possible states.
Results: The authors compared two agents, one that followed their search method and another that started at the same page and received the same instruction but took one action per state and never backtracked. The agents attempted 100 shopping tasks, 50 forum tasks, and 50 classified-ads tasks. The one equipped to search successfully completed 26.4 percent of the tasks, while the other agent completed 18.9 percent of the tasks.
Why it matters: Search joins reflection, planning, tool use, and multi-agent collaboration as an emerging agentic design pattern. Following many branching paths of actions enables an agent to determine the most effective set of actions to accomplish a task.
We’re thinking: Agentic design patterns are progressing quickly! In combination with computer use, this sort of search method may enable agents to execute a wide variety of desktop tasks.