Tool use and planning are key behaviors in agentic workflows that enable large language models (LLMs) to execute complex sequences of steps. New benchmarks measure these capabilities in common workplace tasks.
What’s new: Recent benchmarks gauge the ability of a large language model (LLM) to use external tools to manipulate corporate databases and to plan events such as travel and meetings.
Tool use: Olly Styles, Sam Miller, and colleagues at Mindsdb, University of Warwick, and University of Glasgow proposed WorkBench, which tests an LLM’s ability to use 26 software tools to operate on five simulated workplace databases: email, calendar, web analytics, projects, and customer relationship management. Tools include deleting emails, looking up calendar events, creating graphs, and looking up tasks in a to-do list.
- The benchmark includes 690 problems that require using between zero to 12 tools to succeed. It evaluates individual examples based on whether the databases changed as expected after the final tool had been called (rather than simply whether particular tools were used, as in earlier work). In this way, a model can use tools in any sequence and/or revise its initial choices if they prove unproductive and still receive credit for responding correctly.
- Upon receiving a problem, models are given a list of all tools and an example of how to use each one. Following the ReAct prompting strategy, they’re asked first to reason about the problem and then use a tool. After they’ve received a tool’s output (typically either information or an error message), they’re asked to reason again and choose another tool. The cycle of reasoning, tool selection, and receiving output repeats until the model decides it doesn’t need to use another tool.
- The authors evaluated GPT-4, GPT-3.5, Claude 2, Llama2-70B, and Mixtral-8x7B. GPT-4 performed the best by a large margin: It modified the databases correctly 43 percent of the time. The closest competitor, Claude 2, modified the databases correctly 26 percent of the time.
Planning: Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, and colleagues at Google published Natural Plan, a benchmark that evaluates an LLM’s ability to (i) plan trips, (ii) arrange a series of meeting times and locations, and (iii) schedule a group meeting. Each example has only one solution.
- The benchmark includes 1,600 prompts that ask the model to plan a trip based on an itinerary of cities, time to be spent in each city, total duration of the trip, days when other people are available to meet, and available flights between cities.
- 1,000 prompts ask the model to plan a schedule to meet as many people as possible. The prompts include places, times when people will be in each place, and how long it takes to drive from one place to another.
- 1,000 prompts ask the model, given the existing schedules of a number of people, to find a good time for them to meet.
- The authors tested GPT 3.5, GPT-4, GPT-4o, Gemini 1.5 Flash, and Gemini 1.5 Pro, using five-shot prompts (that is, providing five examples for context). Gemini 1.5 Pro achieved the highest scores on planning trips (34.8 percent) and scheduling group meetings (48.9 percent). GPT-4 ranked second for planning trips (31.1), and GPT-4o ranked second for scheduling meetings (43.7 percent). GPT-4 dominated in arranging meetings (47 percent), followed by GPT-40 (45.2 percent).
Why it matters: When building agentic workflows, developers must decide on LLM choices, prompting strategies, sequencing of steps to be carried out, tool designs, single- versus multi-agent architectures, and so on. Good benchmarks can reveal which approaches work best.
We're thinking: These tests have unambiguous right answers, so agent outputs can be evaluated automatically as correct or incorrect. We look forward to further work to evaluate agents that generate free text output.