Wait a minute — we added training data, and our model’s performance got worse?! New research offers a way to avoid so-called double descent.
What’s new: Double descent occurs when a model’s performance changes in unpredictable ways as the amount of training data or number of parameters crosses a certain threshold. The error falls as expected with additional data or parameters, but then rises, drops again, and may take further turns. Preetum Nakkiran and collaborators at Harvard, Stanford, and Microsoft found a way to eliminate double descent in some circumstances.
Key insight: The researchers evaluated double descent in terms of a model’s test error. Framing the problem this way led them to the conclusion that regularization — discouraging a model from having large weights — can prevent it. Where previous research described the occurrence of double descent as models or datasets grow infinitely large, the authors’ analysis applies to all sizes. This enables them to offer a practical approach to managing the problem.
How it works: The researchers proved that double descent is manageable in linear regression models if the dataset meets certain criteria. They also demonstrated experimental results for a broader class of problems.
- A model’s test error is its average mean squared error over all possible test sets. If the error increases with the size of the model or training set, the model can suffer from double descent.
- The researchers analyzed linear regression models with L1 regularization, also called ridge regression. Selecting the right penalty for a particular model or dataset size mitigates double descent if the input is Gaussian with zero mean and covariance matrix given by the identity matrix.
- In models that don’t use linear regression, such as simple convolutional neural networks, some regularization penalty values mitigated double descent. However, the researchers couldn’t find a way, other than trial and error while peeking at the test set, to choose the penalty.
Results: The researchers proved that their regularization technique prevents double descent in linear regression models if the dataset meets certain criteria. They also used linear regression models with datasets that didn’t match all of their criteria, and in every case they considered, they found a regularization penalty that did the trick.
Yes, but: Although the technique avoided double descent in a variety of circumstances, particularly in linear regression models, the authors were not able to prove that their technique works in every case.
Why it matters: This approach to mitigating double descent may look limited, since it applies only to some linear regression models. But improvements could have broad impact, given that linear regression is ubiquitous in neural network output layers.
We’re thinking: Double descent is sneaky. Researchers can miss it when they run benchmark datasets if they cherry-pick the best-performing models. And engineers can fail to detect it in applications because it isn’t predictable from results on the training set. It may be rare in practice, but we’d rather not have to worry about it.