DeepSeek Sharpens Its Reasoning DeepSeek-R1, an affordable rival to OpenAI’s o1

Published
Reading time
4 min read
Bar chart comparing accuracy and percentile scores of DeepSeek models and OpenAI models across benchmarks.

A new open model rivals OpenAI’s o1, and it’s free to use or modify.

What’s new: DeepSeek released DeepSeek-R1, a large language model that executes long lines of reasoning before producing output. The code and weights are licensed freely for commercial and personal use, including training new models on R1 outputs. The paper provides an up-close look at the training of a high-performance model that implements a chain of thought without explicit prompting. (DeepSeek-R1-lite-preview came out in November with fewer parameters and a different base model.)

Mixture of experts (MoE) basics: The MoE architecture uses different subsets of its parameters to process different inputs. Each MoE layer contains a group of neural networks, or experts, preceded by a gating module that learns to choose which one(s) to use based on the input. In this way, different experts learn to specialize in different types of examples. Because not all parameters are used to produce any given output, the network uses less energy and runs faster than models of similar size that use all parameters to process every input.

How it works: DeepSeek-R1 is a version of DeepSeek-V3-Base that was fine-tuned over four stages to enhance its ability to process a chain of thought (CoT). It’s a mixture-of-experts transformer with 671 billion total parameters, 37 billion of which are active at any given time, and it processes 128,000 tokens of input context. Access to the model via DeepSeek’s API costs $0.55 per million input tokens ($0.14 for cached inputs) and $2.19 per million output tokens. (In comparison, o1 costs $15 per million input tokens, $7.50 for cached inputs, and $60 per million output tokens.)

  • The team members fine-tuned DeepSeek-V3-Base on a synthetic dataset of thousands of long-form CoT examples that were generated using multiple techniques. For instance, they prompted DeepSeek-V3-Base few-shot style with long CoTs as examples, prompted that model to generate detailed answers while evaluating and double-checking its own CoT steps,  and hired human annotators to refine and process the results.
  • They used group relative policy optimization, a reinforcement learning algorithm, to improve the model’s ability to solve challenging problems. For example, for math problems, they created rule-based systems that rewarded the model for returning the final answer in a particular format (an accuracy reward) and for showing its internal CoT steps within <think> tags (a format reward).
  • For further fine-tuning, they used the in-progress versions of R1 to generate around 600,000 responses to reasoning prompts, retaining only correct responses. They mixed in another 200,000 non-reasoning examples (such as language translation pairs) either generated by DeepSeek-V3-base or from its training dataset.
  • They fine-tuned the model using a final round of reinforcement learning. This step encouraged the model to further boost its accuracy on reasoning problems while generally improving its helpfulness and harmlessness.

Other models: DeepSeek researchers also released seven related models.

  • DeepSeek-R1-Zero is similar to DeepSeek-R1, but fine-tuned entirely using reinforcement learning. The researchers note that DeepSeek-R1-Zero was able to develop problem-solving strategies simply by being given incentives to do so. However, it was more likely to mix languages and produce unreadable outputs.
  • DeepSeek also released six dense models (with parameter counts of 1.5 billion, 7 billion, 8 billion, 14 billion, 32 billion, and 70 billion), four of them based on versions of Qwen, and two based on versions of Llama.

Results: In DeepSeek’s tests, DeepSeek-R1 went toe-to-toe with o1, outperforming that model on 5 of 11 of the benchmarks tested. Some of the other new models showed competitive performance, too.

  • DeepSeek-R1 topped o1 on AIME 2024, MATH-500, and SWE-Bench Verified, while turning in competitive performance on Codeforces, GPQA Diamond, and MMLU. For instance, on LiveCodeBench, which includes coding problems that are frequently updated, it solved 65.9 percent of problems correctly, while o1 solved 63.4 percent correctly.
  • It also outperformed two top models that don’t implement chains of thought without explicit prompting. It bested Anthropic Claude 3.5 Sonnet on 19 of 21 benchmarks and OpenAI GPT-4o on 20 of 21 benchmarks.
  • In DeepSeek’s tests, DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across all benchmarks tested including AIME 2024 and GPQA Diamond, while DeepSeek-R1-Distill-Llama-70B beats o1-mini on all benchmarks tested except Codeforces.

Why it matters: Late last year, OpenAI’s o1 kicked off a trend toward so-called reasoning models that implement a CoT without explicit prompting. But o1 and o3, its not-yet-widely-available successor, hide their reasoning steps. In contrast, DeepSeek-R1 bares all, allowing users to see the steps the model took to arrive at a particular answer. DeepSeek’s own experiments with distillation show how powerful such models can be as teachers to train smaller student models. Moreover, they appear to pass along some of the benefits of their reasoning skills, making their students more accurate.

We’re thinking: DeepSeek is rapidly emerging as a strong builder of open models. Not only are these models great performers, but their license permits use of their outputs for distillation, potentially pushing forward the state of the art for language models (and multimodal models) of all sizes.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox