Coding Agents Are Evolving From Novelties to Widely Useful Tools Three research papers offer outstanding ways to use large language models to build coding agents that perform software development tasks automatically.

Published

Jun 19, 2024

Reading time

3 min read

Dear friends,

On Father’s Day last weekend, I sat with my daughter to help her practice solving arithmetic problems. To give her practice problems, I used OpenDevin, an open-source agentic coding framework, to write a Python script that generated questions that she enjoyed answering at her own pace. OpenDevin wrote the code much faster than I could have and genuinely improved my and my daughter’s day.

Six months ago, coding agents were a novelty. They still frequently fail to deliver, but I find that they’re now working well enough that they might be genuinely useful to more and more people!

Given a coding problem that’s specified in a prompt, the workflow for a coding agent typically goes something like this: Use a large language model (LLM) to analyze the problem and potentially break it into steps to write code for, generate the code, test it, and iteratively use any errors discovered to ask the coding agent to refine its answer. But within this broad framework, a huge design space and numerous innovations are available to experiment with. I’d like to highlight a few papers that I find notable:

“AgentCoder: Multiagent-Code Generation with Iterative Testing and Optimisation,” Huang et al. (2024).
“LDB: A Large Language Model Debugger via Verifying Runtime Execution Step by Step,” Zhong et al., (2024).
“SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering,” Yang et al. (2024).

How can we test the code without requiring the user to write test cases? In a multi-agent system, each “agent” is an LLM prompted to play a particular role. An interesting result from AgentCoder shows that having separate agents for writing code and generating tests results in better performance than letting a single agent do both tasks. This is presumably because, if the agent writing the code is also responsible for writing the tests, the tests might be influenced by the code and fail to consider corner cases that the code does not cover.

When people think of testing code, many initially think of output testing, in which we see if the code generates the correct outputs to a specific set of test inputs. If the code fails a test, an LLM can be prompted to reflect on why the code failed and then to try to fix it. In addition to testing the output, the LDB method is helpful. LDB steps through the code and presents to the LLM values of the variables during intermediate steps of execution, to see if the LLM can spot exactly where the error is. This mimics how a human developer might step through the code to see where one of the computational steps went wrong, and so pinpoint and fix the problem.

A lot of agentic workflows mimic human workflows. Similar to other work in machine learning, if humans can do a task, then trying to mimic humans makes development much easier compared to inventing a new process. However, the authors of SWE-agent noticed that many tools that humans use for coding are very inefficient for agents. For example, giving an agent access to a bash shell and having it find a piece of code by executing numerous cd, ls, and cat commands is inefficient, even though humans can do this rapidly. Similarly, visual coding editors like VSCode, emacs, and vim are easy for humans to use, but hard for LLMs (or LMMs) to navigate. Because agents interact with computers differently than humans do, the authors found that building special-purpose tools (functions) to let an agent search, view, and edit codebases resulted in better performance.

One reason research into coding agents is making rapid progress is that their performance can be evaluated automatically and reliably. With benchmarks like HumanEval, MBPP, and SWE-bench, researchers can try out an idea and automatically test how often it generates correct code. In contrast, even though there’s considerable activity on AI research agents that search the web and synthesize an article (I’ve enjoyed using the open-source STORM system by Stanford's Yijia Shao et al.), they are hard to evaluate and this makes progress harder.

Github Copilot was released in 2021, and many developers have been getting coding help by prompting LLMs. The rapid evolution from that to more sophisticated coding agents is expanding how computers can help us with coding tasks, and the pace of progress is rapid. With these tools, I expect programming to become even more fun and more productive.

Keep coding!

Andrew

Subscribe to The Batch