Direct Preference Optimization (DPO)
Reinforcement Learning Heats Up: How DeepSeek-R1 and Kimi k1.5 use reinforcement learning to improve reasoning
Reinforcement learning is emerging as an avenue for building large language models with advanced reasoning capabilities.