How do agents based on large language models compare to human experts when it comes to proposing machine learning research? Pretty well, according to one study.
What’s new: Chenglei Si, Diyi Yang, and Tatsunori Hashimoto at Stanford produced ideas for research in machine learning using Anthropic’s Claude 3.5 Sonnet and human researchers, and also evaluated them using both manual and automated methods. Claude 3.5 Sonnet generated competitive proposals, but its evaluations of proposals were less compelling.
How it works: Each proposal included a problem statement, motivation, step-by-step plan, backup plan, and examples of baseline outcomes versus expected experimental outcomes.
- Automated proposal generation: Given one of seven topics (bias, coding, safety, multilinguality, factuality, math, or uncertainty) and 10 related papers found by the Semantic Scholar search engine, Claude 3.5 Sonnet generated 4,000 research ideas. The authors embedded the ideas using all-MiniLM-L6-v2 and removed duplicate ideas based on the cosine similarity of their embeddings. This left roughly 200 AI-generated ideas for each topic. For each remaining idea, the model generated a proposal.
- Automated ranking: Claude Sonnet 3.5 ranked the proposals in a five-round tournament that awarded points for superior quality and pitted highest-scoring proposals against one another. In addition, one of the authors manually ranked the generated proposals.
- Human proposal generation: The authors paid 49 machine learning engineers to propose their own ideas. They obscured authorship by prompting an unidentified large language model to edit them according to a style guide. Then they manually checked the rewritten proposals to ensure that the model’s editing didn’t change their content significantly.
- Human ranking: A group of 79 machine learning engineers reviewed the 49 human-written proposals, the top 49 AI-generated proposals ranked by humans, and the top 49 AI-generated proposals ranked by AI (resulting in two to four reviews per proposal). They scored the proposals between 1 and 10 on five factors: novelty, feasibility, expected effectiveness, how exciting they were, and overall quality.
Results: Human judges deemed proposals generated by Claude 3.5 Sonnet as good as or better than those produced by humans. However, large language models proved less effective at judging the proposals’ quality.
- On average, humans scored the AI-generated and human-written proposals roughly equally in feasibility, expected effectiveness, how exciting they were, and overall quality. They deemed the AI-generated proposals significantly more novel. The top AI-generated proposals as ranked by humans achieved an average 5.78 novelty. The top AI-generated proposal as ranked by AI achieved an average 5.62 novelty. Human-written proposals achieved an average 4.86 novelty.
- The authors found that LLMs don’t yet match human performance when it comes to judging scientific papers. They compared the rates of agreement among five LLMs that evaluated proposals in their experiment, human judgements of the proposals, and human reviews of papers submitted to the NeurIPS and ICLR conferences. The most consistent LLM, Claude 3.5 Sonnet, was 53.3 percent consistent with average human judgment. The human judges were 56.1 percent consistent. Reviewers for NeurIPS and ICLR were 66 and 71.9 percent consistent respectively. Random chance was 50 percent.
Why it matters: AI models play a growing role in scientific discovery. This work shows they can set directions for research — in machine learning, at least — that rival those set by humans. However, human evaluation remains the gold standard for comparing performance on complex problems like generating text.
We’re thinking: Coming up with good research ideas is hard! That a large language model can do it with some competency has exciting implications for the future of both AI and science.