The route to improving transformer-based language models like GPT-3 and Gopher, which are trained on immense quantities of text scraped from the web, has been to increase their size. But research into the relationship between dataset size and parameter count shows that, given a processing budget, bigger doesn’t necessarily mean better.
What’s new: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, and colleagues at DeepMind determined the optimal data-to-parameter ratio for a range of processing budgets. They used this knowledge to train Chinchilla, a smaller but higher-performance version of Gopher.
Key insight: Pumping up dataset and architecture sizes can improve the performance of language models (with diminishing returns as they increase). But past studies didn’t account for the impact of the number of training tokens (the number of training steps multiplied by the number of tokens per step) or the learning rate. A systematic study of these variables makes it possible to estimate the optimal model and data size for a particular processing budget.
How it works: The authors trained and tested hundreds of transformer-based language models using various combinations of parameter count, dataset size, training token count, and learning rate. They trained the models to complete sentences in 2.35 billion documents scraped from the web.
- The authors experimented with a range of processing budgets (between 10^18 and 10^21 floating point operations, or FLOPs) by varying the number of model parameters (from 70 million to 10 billion) and training tokens (from 10^9 to 10^12). For each model, the authors also searched for the learning rate that resulted in the smallest loss at the end of training.
- The authors measured model performance by the loss value at the end of training. They determined the combinations of training token and parameter counts that led to the lowest loss value for each processing budget.
- They applied this information to the architecture and training procedure used to build Gopher, yielding Chinchilla. Both models were trained with a processing budget of 5.76 x 10^23 FLOPs. Gopher used 280 billion parameters and 300 billion training tokens, while Chinchilla used 70 billion parameters and 1.4 trillion training tokens.
Results: Doubling parameters or training tokens requires quadrupling the processing budget to reach optimal performance. In other words, if you double a model’s parameter count, doubling the number of training tokens will achieve an optimal balance between processing and performance. Given Gopher’s processing budget, Chinchilla outperformed its predecessor on several benchmarks with a quarter of its parameters. On BIG-bench, for example, Chinchilla’s average accuracy was 65.1 percent compared to Gopher’s 54.4 percent. In reading comprehension on LAMBADA, in which the model answers a question after reading a piece of text, Chinchilla attained 77.4 percent accuracy while Gopher achieved 74.5 percent and Megatron-Turing NLG, with a whopping 530 billion parameters, achieved 76.6 percent.
Why it matters: Large models like Gopher aren’t reaching their full potential. Smaller models trained on more training tokens can run faster during inference and achieve better performance.
We’re thinking: In light of this work, a monster model like Megatron-Turing NLG 530B should train on 11 trillion tokens. All the text on the web encompasses only a couple trillion!