Reinforcement learning is emerging as an avenue for building large language models with advanced reasoning capabilities.
What’s new: Two recent high-performance models, DeepSeek-R1 (and its variants including DeepSeek-R1-Zero) and Kimi k1.5, learned to improve their generated lines of reasoning via reinforcement learning. o1 pioneered this approach last year.
Reinforcement learning (RL) basics: RL rewards or punishes a model for performing particular actions or achieving certain objectives. Unlike supervised and unsupervised learning, which compare the model's output to a known ground truth, RL doesn’t explicitly tell a model what it should output. Instead, the model starts out behaving randomly and discovers desired behaviors by earning rewards for its actions. This makes RL especially popular for training machine learning models that play games or control robots.
How it works: To improve the chain of thought (CoT) generated by a large language model (LLM), reinforcement learning encourages the model to generate correct solutions to math, coding, science, and other problems that have known solutions. Unlike typical LLM training, in which the model simply generates the next token of its output and receives feedback token by token, this method rewards the model for generating a sequence of reasoning steps that lead to an accurate conclusion, even if doing so requires generating many intermediate tokens between the prompt and the response — to plan an outline, check the conclusion, or reflect on the approach — without explicit training on the reasoning steps to take.
- The DeepSeek team found that fine-tuning via reinforcement learning alone (after pretraining) was sufficient for DeepSeek-R1-Zero to learn problem-solving strategies like double checking its answer. However, the model also showed quirky behaviors such as mixing different languages in its output. The team overcame these issues in DeepSeek-R1 by supervised fine-tuning on a small number of long CoT examples prior to reinforcement learning.
- Similarly, the Kimi k1.5 team found that fine-tuning the model on long CoTs prior to reinforcement learning enabled it to devise its own problem-solving strategies. The resulting long responses proved to be more accurate but also more expensive to generate, so the team added a second round of reinforcement learning that encouraged the model to produce shorter responses. On the AIME 2024 benchmark of advanced math problems, this process reduced the average number of tokens in the response by around 20 percent, and on MATH-500, it cut the average number of output tokens by roughly 10 percent.
- OpenAI has disclosed limited information about how it trained o1, but team members have said they used reinforcement learning to improve the model’s chain of thought.
Behind the news: While RL has been a staple technique for training models to play games and control robots, its role in developing LLMs has been confined to alignment with human preferences. Reinforcement learning to match judgements of humans (reinforcement learning from human feedback, or RLHF) or AI (Constitutional AI, which uses reinforcement learning from AI feedback or RLAIF) were the primary methods for encouraging LLMs to align with human preferences prior to the development of direct preference optimization.
Why it matters: Reinforcement learning has surprising utility in training large language models to reason. As researchers press models into service in more complex tasks — math, coding, animated graphics, and beyond — reinforcement learning is emerging as an important path to progress.
We’re thinking: Less than three years ago, reinforcement learning looked too finicky to be worth the trouble. Now it’s a key direction in language modeling. Machine learning continues to be full of surprising twists!