OpenAI Forges Chains of Thought OpenAI’s o1 models excel in reasoning, outperform GPT-4o in math and coding

Published
Reading time
3 min read
OpenAI's model scores on the GPQA Diamond tests in biology, chemistry, and physics, along with their overall score.

Preliminary versions of OpenAI’s new model family were trained explicitly to think step-by-step, yielding outstanding marks in math, science, and coding — but users can’t see their reasoning steps.

What’s new: OpenAI launched beta versions of o1-preview and o1-mini, language models that were trained via reinforcement learning to use chains of thought. The models are available to paid ChatGPT users as well as API customers who have been onboard for more than 30 days and spent $1,000. o1-preview costs $15/$60 per million input/output tokens, significantly higher than GPT-4o’s price of $5/$15. o1-mini costs $3/$12 per million input/output tokens. OpenAI didn’t announce a release date for a finished o1 model.

How it works: o1-preview is a preliminary release, and o1-mini is a faster preliminary version that’s particularly effective at coding. OpenAI published an o1 system card but hasn’t disclosed details about the new models’ size, architecture, or training. Both models have an input context window of 128,000 tokens. They accept only text tokens, but OpenAI plans to support other media types in future versions.

  • o1-preview and o1-mini were trained on data scraped from the web, open-source databases, and proprietary data supplied by partners and OpenAI. The reinforcement learning process rewarded the models for generating desired reasoning steps and for their alignment with human values, goals, and expectations. 
  • The beta models process “reasoning tokens” that the company charges for as though they were output tokens although they’re invisible to users. The use of reasoning tokens makes the models slower and costlier to produce output than GPT-4o, but they deliver superior performance in tasks that benefit from step-by-step reasoning. OpenAI provides an example in which o1-preview deciphered enciphered text in which each letter is replaced by two letters that, according to alphabetical order, are equidistant from the intended letter. In other examples, it calculates the pH of a solution of ammonium fluoride and suggests a medical diagnosis based on symptoms that are present and absent.
  • o1-preview’s output is limited to around 32,768 tokens, including reasoning tokens, while o1-mini’s is capped at roughly 65,536. OpenAI recommends budgeting 25,000 tokens for reasoning.
  • OpenAI keeps the chain of thought hidden to avoid exposing information that wasn’t requested. In addition, it doesn’t want users to try to control the model’s reasoning, and it doesn’t want competitors to see what’s going on behind the scenes. (Nonetheless, ChatGPT users can see a summary of steps that led to a given response)
  • OpenAI and third parties conducted safety evaluations, including testing for inappropriate outputs, race, gender, and age biases, and harmful chains of thought. o1-preview and o1-mini returned fewer hallucinations and showed more resistance to jailbreaking attacks than GPT-4o and GPT-4o mini. Both models show a higher risk than previous OpenAI models of helping to produce biological threats, but the risk is within the bounds of its safety policy.

Results: The actual o1 model — which remains unavailable — generally outperforms o1-preview, while both vastly outperform GPT-4o on math, science, and coding benchmarks. 

  • o1: The forthcoming model outperformed GPT-4o on 54 out of 57 MMLU subcategories that test knowledge in fields like elementary mathematics, U.S. history, and law. It achieved an Elo score of 1,673 on coding contests drawn from the website Codeforces (in which it was allowed 10 submissions for any given problem), putting it in the 89th percentile (human expert level). On the GPQA Diamond tests of graduate-level knowledge in biology, chemistry, and physics, it scored higher than PhD-level experts recruited by OpenAI. It correctly answered 74 percent of questions from the 2024 USA Math Olympiad qualifier. 
  • o1-preview: The preview version ranked in the 62nd percentile on Codeforces. Human evaluators preferred its output to that of GPT-4o in response to prompts that tested coding, data analysis, and math. (They preferred GPT-4o’s responses to prompts that requested “personal writing.”)

Behind the news: In recent months, Anthropic has been using the tag <antThinking> to generate thinking tokens that are hidden from users. However, OpenAI’s implementation in the o1 models takes this capability much further.

Why it matters: The o1 models show that the combination of reinforcement learning and chain-of-thought reasoning can solve problems that large language models generally find challenging. They’re substantially more accurate in domains such as coding, math, and science that have low tolerance for error. However, the fact that the models hide their reasoning from users makes them less transparent and explainable than their predecessors and may make their outstanding performance less valuable in some applications.

We’re thinking: Agentic workflows can significantly improve a system’s ability to reflect, reason, and iterate on its output. Training a model to take such steps directly in response to even general-purpose questions opens an exciting alternative path to better reasoning beyond simply scaling up model size.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox