Feb 28, 2024
Human Feedback Without Reinforcement Learning: Direct Preference Optimization (DPO) fine-tunes pretrained large language models on human preferences without the cumbersome step of reinforcement learning.
Reinforcement learning from human feedback (RLHF) is widely used to fine-tune pretrained models to deliver outputs that align with human preferences. New work aligns pretrained models without the cumbersome step of reinforcement learning.