An up-and-coming Hangzhou AI lab unveiled a model that implements run-time reasoning similar to OpenAI o1 and delivers competitive performance. Unlike o1, it displays its reasoning steps.
What’s new: DeepSeek announced DeepSeek-R1, a model family that processes prompts by breaking them down into steps. A free preview version is available on the web, limited to 50 messages daily; API pricing is not yet announced. R1-lite-preview performs comparably to o1-preview on several math and problem-solving benchmarks. DeepSeek said it would release R1 as open source but didn't announce licensing terms or a release date.
How it works: DeepSeek-R1-lite-preview uses a smaller base model than DeepSeek 2.5, which comprises 236 billion parameters. Like o1-preview, most of its performance gains come from an approach known as test-time compute, which trains an LLM to think at length in response to prompts, using more compute to generate deeper answers. Unlike o1-preview, which hides its reasoning, at inference, DeepSeek-R1-lite-preview’s reasoning steps are visible. This makes the model more transparent, but it may also make it more vulnerable to jailbreaks and other manipulation.
- According to DeepSeek, R1-lite-preview, using an unspecified number of reasoning tokens, outperforms OpenAI o1-preview, OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, Alibaba Qwen 2.5 72B, and DeepSeek-V2.5 on three out of six reasoning-intensive benchmarks.
- It substantially outperforms o1-preview on AIME (advanced high school math problems, 52.5 percent accuracy versus 44.6 percent accuracy), MATH (high school competition-level math, 91.6 percent accuracy versus 85.5 percent accuracy), and Codeforces (competitive programming challenges, 1,450 versus 1,428). It falls behind o1 on GPQA Diamond (graduate-level science problems), LiveCodeBench (real-world coding tasks), and ZebraLogic (logical reasoning problems).
- DeepSeek reports that the model’s accuracy improves dramatically when it uses more tokens at inference to reason about a prompt (though the web user interface doesn’t allow users to control this). On AIME math problems, performance rises from 21 percent accuracy when it uses less than 1,000 tokens to 66.7 percent accuracy when it uses more than 100,000, surpassing o1-preview’s performance. The additional performance comes at the cost of slower and more expensive output.
Behind the news: DeepSeek-R1 follows OpenAI in implementing this approach at a time when scaling laws that predict higher performance from bigger models and/or more training data are being questioned.
Why it matters: DeepSeek is challenging OpenAI with a competitive large language model. It’s part of an important movement, after years of scaling models by raising parameter counts and amassing larger datasets, toward achieving high performance by spending more energy on generating output.
We’re thinking: Models that do and don’t take advantage of additional test-time compute are complementary. Those that do increase test-time compute perform well on math and science problems, but they’re slow and costly. Those that don’t use additional test-time compute do well on language tasks at higher speed and lower cost. Applications that require facility in both math and language may benefit by switching between the two.