Agentic Coding Strides Forward Genie coding assistant outperforms competitors on SWE-bench by over 30 percent

Published
Reading time
2 min read
The SWE-bench full leaderboard shows Cosine Genie outperforming its competitors.

An agentic coding assistant boosted the state of the art in an important benchmark by more than 30 percent.

What’s new: Cosine, a startup based in London, unveiled Genie, a coding assistant that achieves top performance on SWE-bench, which tests a model’s ability to solve GitHub issues. The company has yet to announce pricing and availability, but a waitlist is available.

How it works: Genie is a fine-tuned version of GPT-4o with a larger context window of undisclosed size. It works similarly to agentic coding tools like Devin, Q, OpenDevin, and SWE-agent. Its agentic workflow loops through four processes: retrieving information, planning, writing code, and running it. It was trained on a proprietary training set that captures software engineers’ processes for reasoning, gathering information, and making decisions. It edits lines of code in place rather than rewriting entire sections or files from scratch. 

  • Cosine initially fine-tuned Genie roughly equally on six software engineering tasks: developing features, fixing bugs, refactoring, making minor changes, writing tests, and writing documentation. The fine-tuning set included 15 programming languages, mostly JavaScript and Python (21 percent each) followed by TypeScript and TSX (14 percent each). 
  • Subsequent fine-tuning focused on finishing incomplete code and fixing imperfect code, which was underrepresented in the initial dataset. This round of training used incorrect examples generated by Genie itself. By comparing Genie’s initial incorrect output with correct examples, the model improved its ability to recognize and fix mistakes.
  • At inference — given a prompt in natural language, a ticket that outlines a programming task, or a GitHub issue — the model retrieves relevant files and documentation, makes a plan for fixing the issue, and writes new code. After writing new code, it runs verification tests. If the tests fail, it loops between planning and coding until the tests succeed.
  • Genie can also create and monitor pull requests on GitHub. It responds to human comments on its own pull requests just like it acts upon GitHub issues.

Results: Tested on SWE-bench Full (2,294 issue-commit pairs across 12 Python repositories), Genie solved 30.1 percent of problems, far ahead of the next closest competitor, Amazon Q, at 19.75 percent. Genie achieved 50.7 percent of the SWE-bench Lite (winnowed to 300 issue-commit pairs to save computation), beating CodeStory Aide plus other models at 43 percent. (Genie’s results don’t appear on the official SWE-bench leaderboard. The leaderboard requires that models document their workings, which Cosine declined to avoid revealing proprietary information. Cosine released Genie’s solution sets to verify its performance.)

Behind the news: SWE-bench’s creators recently collaborated with OpenAI to produce a new version, SWE-bench Verified. They eliminated extremely difficult and poorly configured problems, leaving 500 human-verified issue-commit pairs. Cosine has yet to publish Genie’s performance on SWE-bench Verified. As of this writing, Amazon Q ranks in first place with 38.8 percent. 

Why it matters: Some developers of AI coding assistants train models to follow human-style procedures while others are building AI-native methods. Genie takes a distinct step forward by mimicking software engineers. Competition between the two approaches, along with longer context windows, faster inference, and increasingly sophisticated agentic workflows, is driving improvement of coding assistants at a rapid pace. 

We’re thinking: We’re glad this Genie escaped the bottle!

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox