Most models that have learned to reason via reinforcement learning were huge models. A much smaller model now competes with them.
What’s new: Alibaba introduced QwQ-32B, a large language model that rivals the reasoning prowess of DeepSeek-R1 despite its relatively modest size.
- Input/output: Text in (up to 131,072 tokens), text out
- Architecture: Transformer, 32.5 billion total parameter
- Performance: Outperforms OpenAI o1-mini and DeepSeek-R1 on some bencharks
- Features: Chain-of-thought reasoning, function calling, multilingual in 29 languages
- Undisclosed: Output size, training data
- Availability/price: Free via Qwen Chat. Weights are free to download for noncommercial and commercial uses under an Apache 2.0 license.
How it works: QwQ-32B is a version of Qwen2.5-32B that was fine-tuned to generate chains of thought using reinforcement learning (RL). Fine-tuning proceeded in two stages.
- The first stage of RL fine-tuning focused on math and coding tasks. The model earned rewards for correct final outcomes (no partial credit for intermediate steps). An accuracy verifier checked its math solutions, while a code-execution server verified generated code for predefined test cases.
- The second stage encouraged the model to follow instructions, use tools, and align its values with human preferences while maintaining math and coding performance, again rewarding final outcomes. In this stage, the model earned rewards from an unspecified reward model and some rule-based verifiers.
Performance: On several benchmarks for math, coding, and general problem solving, QwQ-32B outperforms OpenAI o1-mini (parameter count undisclosed) and achieves performance roughly comparable to DeepSeek-R1 (671 billion parameters, 37 billion active at any moment).
- On AIME24 (high-school competition math problems), QwQ-32B achieved 79.5 percent accuracy, well ahead of o1-mini (63.6 percent) but slightly behind DeepSeek-R1 (79.8 percent).
- On LiveCodeBench (code generation, repair, and testing), QwQ-32B achieved 63.4 percent, outperforming o1-mini (53.8 percent) but trailing DeepSeek-R1 (65.9 percent).
- On LiveBench (problem-solving in math, coding, reasoning, and data analysis), QwQ-32B reached 73.1 percent, ahead of o1-mini (59.1 percent) and DeepSeek-R1 (71.6 percent).
- On IFEval (following instructions), QwQ-32B achieved 83.9 percent, outperforming DeepSeek-R1 (83.8 percent) but behind o1-mini (84.8 percent).
- On BFCL (function calling), QwQ-32B achieved 66.4 percent, better than DeepSeek-R1 (60.3 percent), and o1-mini (62.8 percent).
Behind the news: DeepSeek’s initial model, DeepSeek-R1-Zero, similarly applied RL to a pretrained model. That effort produced strong reasoning but poor readability (for example, math solutions with correct steps but jumbled explanations). To address this shortcoming, the team fine-tuned DeepSeek-R1 on long chain-of-thought examples before applying RL. In contrast, QwQ-32B skipped preliminary fine-tuning and applied RL in two stages, first optimizing for correct responses and then for readability.
Why it matters: RL can dramatically boost LLMs’ reasoning abilities, but the order in which different behaviors are rewarded matters. Using RL in stages enabled the team to build a 32 billion parameter model — small enough to run locally on a consumer GPU — that rivals a much bigger mixture-of-experts model, bringing powerful reasoning models within reach for more developers. The Qwen team plans to scale its RL approach to larger models, which could improve the next-gen reasoning abilities further while adding greater knowledge.
We’re thinking: How far we’ve come since “Let’s think step by step”!