Direct Preference Optimization (DPO)

4 Posts

Benchmark results for Phi-4, GPT, LLaMA-3.3, and Qwen 2.5 models.
Direct Preference Optimization (DPO)

Phi-4 Beats Models Five Times Its Size: Microsoft’s Phi-4 blends synthetic and organic data to surpass larger models in math and reasoning benchmarks

Microsoft updated its smallest model family with a single, surprisingly high-performance model.
Table comparing HarmBench and AdvBench ASR performance across models and benchmarks.
Direct Preference Optimization (DPO)

Breaking Jailbreaks: New E-DPO method strengthens defenses against jailbreak prompts

Jailbreak prompts can prod a large language model (LLM) to overstep built-in boundaries, leading it to do things like respond to queries it was trained to refuse to answer. Researchers devised a way to further boost the probability that LLMs will respond in ways that respect such limits.
More Factual LLMs: FactTune, a method to fine-tune LLMs for factual accuracy without human feedback
Direct Preference Optimization (DPO)

More Factual LLMs: FactTune, a method to fine-tune LLMs for factual accuracy without human feedback

Large language models sometimes generate false statements. New work makes them more likely to produce factual output.
Human Feedback Without Reinforcement Learning: Direct Preference Optimization (DPO) fine-tunes pretrained large language models on human preferences without the cumbersome step of reinforcement learning.
Direct Preference Optimization (DPO)

Human Feedback Without Reinforcement Learning: Direct Preference Optimization (DPO) fine-tunes pretrained large language models on human preferences without the cumbersome step of reinforcement learning.

Reinforcement learning from human feedback (RLHF) is widely used to fine-tune pretrained models to deliver outputs that align with human preferences. New work aligns pretrained models without the cumbersome step of reinforcement learning.

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox