Last week we reported on a formula to determine model width and dataset size for optimal performance. A new paper contributes equations that optimize some training parameters.
What’s new: Jared Kaplan and Sam McCandlish led researchers at Johns Hopkins and OpenAI to derive equations that describe the effects of parameter count, training corpus size, batch size, and training time on language model performance, plus their own ways to find the best model and dataset sizes.
Key insight: The researchers devised a set of equations that approximate the effects of different combinations of two variables. It’s easier to reason about 2D graphs than to visualize an n-dimensional surface.
Findings: Three findings stand out. First, as many researchers have suspected, transformers outperform LSTMs when trained to convergence. Second, where data and compute are limited, it’s more efficient to train a large model in fewer training steps than to train a smaller model to convergence. Third, some researchers have conjectured that exceeding a so-called critical batch size degrades performance. The researchers offer a way to find optimal batch sizes.
How it works: The researchers trained many model shapes and sizes on various subsets of a proprietary dataset of Reddit posts and linked articles. They measured performance of every combination during training to track the impact of design choices on performance.
- They derived a slightly different formula for loss as a function of model and data size than recent MIT research, but they found a similar relationship: Increasing either parameter count or dataset size improves performance to a point, but then the gains level off. They established a similar relationship between model size and number of training steps.
- The equation that evaluates loss for a given parameter count and numbers of training steps revealed a lower boundary on the number of training steps necessary for early stopping to prevent overfitting. As you might expect, the number of training steps before overfitting rises with dataset size.
- Optimal batch size depends on the model’s loss, they found, not parameter count or dataset size. Optimal batch size rises as the loss decreases.
Why it matters: Most machine learning practitioners don’t have the seemingly infinite computational resources that some large companies do. These insights should help them use resources more effectively.
We’re thinking: Natural language processing is notoriously compute-hungry. The ability to balance processing power against performance could not only save money but reduce environmental impacts.