How much data do we want? More! How large should the model be? Bigger! How much more and how much bigger? New research estimates the impact of dataset and model sizes on neural network performance.
What’s new: Jonathan Rosenfeld and colleagues at MIT, York University, Harvard, Neural Magic, and Tel Aviv University introduced an equation — a so-called error landscape — that predicts how much data and how large a model it takes to generalize well.
Key insight: The researchers made some assumptions; for instance, models without training should be as accurate as a random guess. They combined these assumptions with experimental observations to create an equation that works for a variety of architectures, model sizes, data types, and dataset sizes.
How it works: The researchers trained various state-of-the-art vision and language models on a number of benchmark datasets. Considering 30 combinations of architecture and dataset, they observed three effects when varying data and model size:
- For a fixed amount of data, increasing model size initially boosted performance on novel data, though effect leveled off. The researchers observed a similar trend as they increased the amount of training data. The effect of boosting both model and dataset size was approximately the same as the combined impact of changing each one individually.
- An equation captures these observations. It describes the error as a function of model and data size, forming a 3D surface or error landscape.
- The equation contains variables dependent on the task. Natural language processing, for instance, often requires more data than image processing. A simple regression can determine their values for a target dataset.
Results: After fitting dataset-specific variables to the validation dataset, the researchers compared the predicted model error against the true error on the test set. The predictions were within 1 percent of the true error, on average.
Why it matters: It turns out that the impact on accuracy of model and dataset size is predictable. How nice to have an alternative to trial and error!
Yes, but: When varying network sizes, the researchers focused mainly on scaling width while holding the rest of the architecture constant. Neural network “size” can’t be captured in a single number, and we look forward to future work that considers this nuance.
We’re thinking: Learning theory offers some predictions about how algorithmic performance should scale, but we’re glad to see empirically derived rules of thumb.