The State of the Art Is Open Meta’s Llama 3.1 outperforms GPT-4 in key areas

Published

Jul 31, 2024

Reading time

4 min read

Meta raised the bar for large language models with open weights and published details about how it built one that outperforms GPT-4o and Claude 3.5 Sonnet by some measures.

What's new: Llama 3.1 405B delivers state-of-the-art performance on a handful of public benchmarks and has a context window of 128,000 input tokens while allowing a range of commercial uses. In addition to the 405-billion parameter model, Meta released new versions of the earlier Llama 3 70B (70 billion parameters) and 8B (8 billion parameters). Model weights are available here.

Key insight: Fine-tuning on generated data can improve a model’s performance, but incorrect or lower-quality examples degrade it. The Llama team undertook an extensive effort to fix or remove bad examples using a variety of tools including the model itself, auxiliary models, and off-the-shelf tools.

How it works: Llama 3.1 models are transformers that have been pretrained to predict the next token in a sequence. Meta provided more information about the development of Llama 3.1 405B than the smaller versions. Its pretraining dataset comprised 16.4 trillion tokens of text, “much” of it scraped from the web. The pretrained model was fine-tuned to perform seven tasks, including coding and reasoning, via supervised learning and direct preference optimization (DPO). Most of the fine-tuning data was generated by the model itself and curated using a variety of methods including agentic workflows. For instance,

To generate good code to learn from, the team: (1) Generated programming problems from random code snippets. (2) Generated a solution to each problem, prompting the model to follow good programming practices and explain its thought process in comments. (3) Ran the generated code through a parser and linter to check for issues like syntax errors, style issues, and uninitialized variables. (4) Generated unit tests. (5) Tested the code on the unit tests. (6) If there were any issues, regenerated the code, giving the model the original question, code, and feedback. (7) If the code passed all tests, added it to the dataset. (8) Fine-tuned the model. (9) Repeated this process several times.
To generate fine-tuning data that represented good lines of reasoning, the team: (1) Generated math questions and answers from math problems. (2) Manually identified the types of problems the model struggled with. (3) Asked humans to write questions for those problems. (4) Generated step-by-step answers for those problems. (5) Removed examples that end with the wrong answer. (6) Asked the model to determine whether the reasoning was correct. (7) Removed examples that the model identified as having incorrect reasoning. (8) Trained separate models to determine if the reasoning was correct. (9) Used those models to filter out incorrect examples.

Results: The authors compared Llama 3.1 405B to Claude 3.5 Sonnet, GPT-4, GPT-4o, and Nemotron 4 340B on 16 public benchmarks. It either outperformed or tied the other models on seven of the 16 (although two, GSM8K and MMLU zero-shot chain-of-thought, are not directly comparable due to differences in prompting methods). For instance, Llama 3.1 405B set a new state of the art in IFEval (general knowledge), ARC Challenge (reasoning), and Nexus (tool use). The smaller versions outperformed other models in the same general size classes as well. Llama 3.1 70B set new states of the art in all benchmarks for general knowledge, coding, math, and reasoning. Llama 3.1 8B dominated general, coding, and math benchmarks.

License: Llama 3.1 models are licensed under a custom license that allows both commercial use (by companies with up to 700 million monthly active users in the month prior to Llama 3.1’s release) and training other models on generated data. This enables many companies to use it as they like while potentially requiring Meta’s largest competitors to negotiate a commercial license.

The French connection: Separately, Mistral announced its next-generation LLM Mistral Large 2, which allows noncommercial use but requires a special license for commercial use. The 123 billion-parameter model boasts performance similar to that of Llama 3.1 405B on a number of benchmarks despite being less than one-third the size.

Why it matters: The Llama 3.1 family continues Meta’s contributions in open models and extends them to some commercial uses. The upgraded 8B and 70B models perform better than their predecessors, while the 405B version rivals top proprietary models and enables researchers to generate high-quality synthetic data for training further models. The team provides extensive detail about how they generated fine-tuning data. For each task, they describe the pipeline used to create the data along with various notes about what worked and what didn’t work for them — helpful information for researchers who aim to build next-generation LLMs.

We're thinking: Data-centric AI, the discipline of systematically engineering data to build a successful AI system, is critical for machine learning. The Llama 3.1 paper makes clear that systematically engineering the training data was also a key to training what is, as far as we know, the first open weights model to achieve better performance than the best proprietary models on multiple benchmarks. The potential of open weights is looking better every day!

Subscribe to The Batch