Coding agents are improving, but can they tackle machine learning tasks?
What’s new: Chan Jun Shern and colleagues at OpenAI introduced MLE-bench, a benchmark designed to test how well AI coding agents do in competitions hosted by the Kaggle machine learning contest platform. The benchmark is available here.
Agentic framework basics: An agentic framework or scaffold consists of a large language model (LLM) and code to prompt the model to follow a certain procedure. It may also contain tools the LLM can use, such as a Python console or web browser. For example, given a problem to solve, a framework might prompt the model to generate code, run the code in the Python console, generate evaluation code, run evaluation code, change the solution based on the console’s output, and repeat until the problem is solved.
How it works: MLE-bench is an offline competition environment that contains 75 Kaggle competitions selected manually by the authors, such as contests to identify toxic comments and predict volcanic eruptions. Each competition includes a description, training and testing datasets, code to grade submissions, a leaderboard of human contestants for comparison with an agent’s performance, and a “complexity” rating (produced by OpenAI): low (takes an experienced human less than two hours to code a solution, not including training time), medium (between two and 10 hours), or high (more than 10 hours). Given a competition, an agent must produce a submission by (i) generating code to train a machine learning model and (ii) running the model on the test set. Users grade the submission to evaluate the agent’s performance.
- The authors ran their benchmark on three open source agentic frameworks using GPT-4o as the LLM. The frameworks were AIDE, ResearchAgent, and CodeActAgent. AIDE earned the highest score.
- They ran their benchmark again on AIDE, this time using four different LLMs: o1-preview, GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B.
- To make sure the agents didn’t find the solution in a web search or use a successful solution that was included in the LLM’s training data, the authors performed two checks: (i) GPT-4o checked the agent’s logs for calls to an external API or downloads of restricted resources and (ii) the Dolos anti-plagiarism tool compared the agent’s submission with the top 50 human submissions.
Results: The authors evaluated agent performance according to Kaggle’s standards for awarding medals to human contestants (described in the final bullet below).
- The pairing of AIDE/o1-preview performed best, winning medals in 16.9 percent of competitions.
- AIDE/GPT-4o was a distant second place with medals in 8.7 percent of competitions.
- AIDE/Claude 3.5 Sonnet won medals in 7.6 percent of competitions.
- AIDE/Llama 3.1 won medals in 3 percent of competitions.
- Kaggle does not award medals for certain types of competition. However, for competitions in which it does award medals, it uses the following formula: For competitions in which less than 250 human teams participated, contestants win a medal if they score within the top 40 percent. For competitions in which 250 to 999 teams participated, they win a medal if they score in the top 100. For competitions that included 1,000 teams or more, they win a medal if they score within the top 10 percent.
Yes, but: The percentage of medals won by agents in this study is not comparable to percentages of medals won by humans on Kaggle. The authors awarded medals for excellent performance in all competitions included in the benchmark, but Kaggle does not. The authors didn’t tally the agents’ win rate for only competitions in which Kaggle awarded medals.
Why it matters: It’s important to evaluate the abilities of coding agents to solve all kinds of programming problems. Machine learning tasks are especially valuable as they bear on the ability of software to analyze unstructured data and adapt to changing conditions.
We’re thinking: We’re glad to see machine learning catching on among humans and machines alike!