Published
Reading time
3 min read
o1 Family Benchmarks comparing pass rates across AIME, Codeforces, and GPQA.
Loading the Elevenlabs Text to Speech AudioNative Player...

OpenAI launched not only its highly anticipated o1 model but also an operating mode that enables the model to deliver higher performance — at a hefty price.

What’s new: Kicking off a 12-day holiday blitz, OpenAI launched o1 (previously available in preview and mini versions) and introduced o1 pro mode, which processes more tokens at inference to produce more accurate output. Both options accept text and image inputs to generate text outputs. They’re available exclusively through a new ChatGPT Pro subscription for $200 monthly. API access is not yet available.

How it works: According to an updated system card, o1 models were trained on a mix of public, licensed, and proprietary text, code, and images, with a focus on technical, academic, and structured datasets. They respond to prompts by breaking them down into intermediate steps, each of which consumes a number of hidden “reasoning tokens.” The models don’t reveal these steps, but ChatGPT presents a natural-language summary of the reasoning process. The new o1 and o1 pro mode perform better than o1-preview and o1-mini, but their additional reasoning requires more processing, which translates into higher costs and slower responses.

  • o1 consistently outperforms o1-preview in one-shot benchmarks that measure accuracy in advanced math problems (AIME 2024), coding challenges (Codeforces), and graduate-level science questions (GPQA Diamond).
  • o1 pro mode performs only slightly better than o1 on one-shot tests, but its higher accuracy is more evident when it’s asked to respond to the same input four times in a row. For example, given a problem from the American International Mathematics Examination, o1 solves it correctly 78 percent of the time, o1 pro mode 86 percent of the time. Given the same problem four times, o1 solves it correctly in all four tries 67 percent of the time, while o1 pro mode solves it correctly in all four tries 80 percent of the time.
  • o1 and o1 pro mode are less prone to generating false or irrelevant information than o1-preview, as measured by OpenAI’s SimpleQA, which tests the ability to recall facts about science, geography, history, and the like, and PersonQA, which tests the ability to recall facts about people.
  • ChatGPT Pro provides chatbot access to o1, o1 pro mode, and other OpenAI models. Subscribers get unlimited use of o1. OpenAI has not clarified whether o1 pro mode is subject to usage limits or other constraints.

Behind the news: Since September, when OpenAI introduced o1-preview and o1-mini, other model providers have implemented similar reasoning capabilities. DeepSeek’s R1 displays reasoning steps that o1 models keep hidden. Alibaba’s QwQ 32B excels at visual reasoning but is slower and has a smaller context window. Amazon’s Nova Premier, which is billed as a model for “complex reasoning tasks,” is expected in early 2025, but Amazon has not yet described its performance, architecture, or other details.

Why it matters: o1 and o1 pro mode highlight a dramatic shift in model development and pricing. Giving models more processing power at inference enables them to provide more accurate output, and it’s a key part of agentic workflows. It also continues to boost performance even as scaling laws that predict better performance with more training data and compute may be reaching their limits. However, it also raises OpenAI’s costs, and at $200 a month, the price of access to o1 and o1 pro is steep. It’s a premium choice for developers who require exceptional accuracy or extensive reasoning.

We’re thinking: Discovering scaling laws for using more processing at inference, or test-time compute, is an unsolved problem. Although OpenAI hasn’t disclosed the algorithm behind o1 pro mode, recent work at Google allocated tokens dynamically at inference based on a prompt’s difficulty. This approach boosted the compute efficiency by four times and enabled a model that had shown “nontrivial success rates” to outperform one that was 14 times larger.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox