Why build an ensemble of models when you can average their weights?
What’s new: A model whose weights were the mean of an ensemble of fine-tuned models performed as well as the ensemble and better than its best-performing constituent. Mitchell Wortsman led colleagues at University of Washington, Tel Aviv University, Columbia University, Google, and Meta to build this so-called model soup.
Key insight: When fine-tuning a given architecture, it’s common to try many combinations of hyperparameters, collect the resulting models into an ensemble, and combine their results by, say, voting or taking an average. However, the computation and memory requirements increase with each model in the ensemble. Averaging the fine-tuned weights might achieve similar performance without the need to run several models at inference.
How it works: The authors investigated model soups based on 72 pre-trained CLIP models that were fine-tuned on ImageNet.
- The authors fine-tuned the models by varying hyperparameters including data augmentations, learning rates, lengths of training, label smoothing (which tempers a model’s response to noisy labels by adding noise), and weight decay (which helps models generalize by encouraging weights to be closer to zero during training).
- They sorted the fine-tuned models according to their accuracy on the validation set.
- Starting with the best-performing model, they averaged its weights with those of the next-best performer. If performance improved, they kept the averaged weights; otherwise, they kept the previous weights. They repeated this process for all fine-tuned models.
Results: The authors’ model achieved 81.03 percent accuracy on ImageNet, while an ensemble of the 72 fine-tuned models achieved 81.19 percent and the single best-performing model achieved 80.38 percent. Testing the ability to generalize to a number of shifted distributions of ImageNet, the authors’ model achieved 50.75 percent average accuracy, the ensemble 50.77 percent, and the best model 47.83 percent.
Why it matters: When training models, it’s common to discard weaker models or build an ensemble. The model-soup method puts that effort into better performance without costing computation or memory at inference.
We're thinking: Averaging weights across various numbers of training steps increased performance in prior work. It's good to find that this method extends to different training runs.