Microsoft updated its smallest model family with a single, surprisingly high-performance model.
What’s new: Marah Abdin and a team at Microsoft released Phi-4, a large language model of 14 billion parameters that outperforms Llama 3.3 70B and Qwen 2.5 (72 billion parameters) on math and reasoning benchmarks. The model is available at Azure AI Foundry under a license that permits non-commercial uses, and the weights will be released via Hugging Face next week.
How it works: Phi-4 is a transformer that processes up to 16,000 tokens of input context. The ways the authors constructed the pretraining and fine-tuning datasets accounts for most of its performance advantage over other models.
- Much of the pretraining set was high-quality data from the web or existing datasets. The authors used known high-quality datasets and repositories of high-quality web data (like books and research papers). They also filtered websites using classifiers they trained to recognize high-quality text.
- The rest of the pretraining data was generated or rewritten by GPT-4o. Given snippets of text from web pages, code, scientific papers, and books, GPT-4o rewrote them as exercises, discussions, question-and-answer pairs, and structured reasoning tasks. GPT-4o then followed a feedback loop to improve its accuracy by critiquing its own outputs and generating new ones.
- The authors fine-tuned Phi-4 on existing and newly generated data they acquired in similar ways.
- They further fine-tuned it on two rounds of generated data using Direct Preference Optimization (DPO), which trains models to be more likely to generate a preferred example and less likely to generate a not-preferred example. In the first round, the authors generated preferred/not-preferred pairs by identifying important tokens in generated responses: They considered a token to be important if, after the model generated it (as part of a partial response), the probability that it ultimately would produce a correct output significantly improved (or declined). They measured this probability by generating multiple completions of a given prompt and determining the percentage of times the model produced the correct answer after generating a given token. The preferred/not-preferred pairs (in which one element of the pair is composed of an input, token(s) to generate, and preferred or not-preferred label) took tokens generated prior to the important token as the input, the important token as the preferred token, and the important token that decreased the probability as the not-preferred token.
- In the second round of generating preferred/not-preferred pairs and fine-tuning via DPO, the authors generated responses from GPT-4o, GPT-4 Turbo, and Phi-4, and then used GPT-4o to rate them. Highly rated responses were preferred, and lower-rated responses were not preferred.
Results: Of 13 benchmarks, Phi-4 outperforms Llama 3.3 70B (its most recent open weights competitor) on six and Qwen 2.5 on five.
- Phi-4 outperforms Llama 3.3 70B, Qwen 2.5, and GPT-4o on GPQA (graduate level questions and answers) and MATH (competition-level math problems).
- However, Llama 3.3 70B wins DROP (reading comprehension) and SimpleQA (answering questions about basic facts). Llama 3.3 70B also performs significantly better on IFEval (instruction-following).
Why it matters: Phi-4 shows that there’s still room to improve the performance of small models by curating training data, following the age-old adage that better data makes a better model.
We’re thinking: Some researchers found that earlier versions of Phi showed signs of overfitting to certain benchmarks. In their paper, the Microsoft team stressed that they had improved the data decontamination process for Phi-4 and added an appendix on their method. We trust that independent tests will show that Phi-4 is as impressive as its benchmark scores suggest.