While some observers argue that large language models can’t produce truly original output, new work prompted them to generate novel scientific research.
What’s new: Researchers proposed AI Scientist, an agentic workflow that directs large language models to generate ideas for AI research, produce code to test them, and document the enquiry. You can see examples of its output and download the code to generate your own papers here. The team included Chris Lu, Cong Lu, Robert Tjarko Lange, and colleagues at Tokyo-based startup Sakana AI, University of Oxford, University of British Columbia, Vector Institute, and the Canadian Institute for Advanced Research.
How it works: The authors used Claude Sonnet 3.5, GPT-4o, DeepSeek Coder, and LLama 3.1 405B to generate papers in three categories: diffusion image modeling, transformer-based language modeling, and “grokking,” which the authors define as generalization and speed of learning in deep neural networks.
- The authors prompted a given large language model (LLM) to generate “the next creative and impactful idea for research” in one of the three categories. Then they provided an API to search papers and asked the LLM to either determine whether its idea was novel (in which case it moved to the next step) or, if it couldn’t determine an answer, generate a search query to find related works. Then the authors asked again in light of the search results. They repeated this process until the LLM made a decision.
- Once they had a novel idea, they prompted the LLM to generate a list of experiments and run them using the Aider Python library. Then they prompted it to generate notes about the results and generate figures by altering an existing Python script.
- They prompted the LLM to generate a paper, one section at a time, given the notes, figures, sections generated so far, and tips on how to write a paper based on an existing guide. Then they prompted it to search for related works and add relevant citations. Finally, they asked it to remove redundancy, reduce verbosity, and finalize the document’s format.
Results: The team used GPT-4o to evaluate the generated papers according to the guidelines for papers presented at the Neural Information Processing Systems (NeurIPS) conference. The guidelines include an overall score between 1 (very strongly reject) and 10 (award-quality: flawless and groundbreaking) and a decision to reject or accept the paper.
- Of the four LLMs, Claude Sonnet 3.5 performed best. Its highest-scoring papers achieved 6 (weak accept). With respect to one of Claude’s works, the authors wrote, “The AI Scientist correctly identifies an interesting and well-motivated direction in diffusion modeling research . . . It proposes a comprehensive experimental plan to investigate its idea, and successfully implements it all, achieving good results." The authors provide an archive of Claude’s output here.
- GPT-4o ranked second. Its highest-scoring paper achieved 5 (borderline accept).
- The generated papers achieved an average score of 4.05 or less (4 is borderline reject) across all models and categories of experiment. The experiments generally involved small networks that were trained and tested on generated data. The authors note that the system often failed to implement its ideas, sometimes fabricated results, and sometimes failed to cite the most relevant papers, among other issues.
Why it matters: Agentic workflows are a rising theme in AI research from simpler design patterns like reflection to complex workflows for translating literature. These workflows make it possible to break down complex problems into more manageable subtasks. By breaking the task of conducting AI research into various stages of generating ideas, testing them, and writing a paper, an LLM that has access to the right tools can generate novel research papers with actual experimental results.
We’re thinking: Rather than merely synthesizing existing knowledge, this work points a fascinating direction for using AI to generate new knowledge! Right now, an LLM can suggest starting points for human researchers along with experiments that back up its suggestions.