Better Performance From Merged Models Localize-and-Stitch improves methods for merging and fine-tuning multiple models

Published

Jan 08, 2025

Reading time

3 min read

Merging multiple fine-tuned models is a less expensive alternative to hosting multiple specialized models. But, while model merging can deliver higher average performance across several tasks, it often results in lower performance on specific tasks. New work addresses this issue.

What’s new: Yifei He and colleagues at University of Illinois Urbana-Champaign and Hong Kong University of Science and Technology proposed a model merging method called Localize-and-Stitch. The 2022 paper on “model soups” proposed averaging all weights of a number of fine-tuned versions of the same base model. Instead, the new method selectively retains the weights that are most relevant to each task.

Key insight: Naively merging fine-tuned models by averaging weights that correspond in their architectures can lead to suboptimal performance because different fine-tuned models may use the same portions of weights to perform different tasks. For instance, one model may have learned to use a particular subset of weights to detect HTML code, while another learned to use the same subset to detect city names. Averaging them would likely result in a merged model that underperformed the fine-tuned models on those tasks. But research has shown that fine-tuning often results in many redundant sets of weights. Only a small subset of total parameters (around 1 percent) is enough to maintain a fine-tuned model’s performance on its fine-tuned task. These subsets are small enough that they’re unlikely to overlap, so retaining them improves the merged model’s performance compared to averaging.

How it works: The authors experimented with RoBERTa-base, GPT2-XL, and CLIP. They created 12 variations on the RoBERTa-base language encoder, fine-tuning each on a different task from GLUE such as question answering or sentiment classification. They downloaded three versions of GPT2-XL that had been fine-tuned for instruction following, scientific knowledge, and truthfulness. Finally, they created eight variations on CLIP by fine-tuning each on a different image classification dataset, including handwritten digits, photos of various makes/models/years of cars, and satellite images of forests, pastures, bodies of water, buildings, and the like.

The authors identified task-specific weights in each fine-tuned model. To accomplish this, they decomposed the fine-tuned model’s weights into pretrained weights plus differences.
They identified the smallest number of differences that maximized performance on the task. They zeroed out the rest.
Where the nonzero entries did not overlap, they added the differences to the pretrained weights. In the unlikely case that the nonzero entries overlapped, they averaged the weights of the fine-tuned models.

Results: Models merged using Localize-and-Stitch outperformed or nearly matched the same models merged using earlier methods, though they underperformed individual models fine-tuned for each task.

Using Localize-and-Stitch to merge the fine-tuned versions of RoBERTa-base, the merged model achieved a 75.9 percent average score on GLUE. The previous best method, RegMean, achieved 73.9 percent. The individual models fine-tuned for each GLUE task achieved an average of 81.1 percent.
The fine-tuned versions of GPT2-XL that were merged using Localize-and-Stitch achieved a 36.7 percent average score across MMLU, ARC, and TruthfulQA. The versions merged by averaging corresponding weights achieved 34.4 percent. The individual fine-tuned models achieved an average of 41.1 percent.
The fine-tuned versions of CLIP that were merged via Localize-and-Stitch achieved an average score 79.9 percent across the eight vision tasks. Versions merged using AdaMerging achieved 80.1 percent. The individual fine-tuned models achieved an average of 90.5 percent.

Yes, but: The authors didn’t compare Localize-and-Stitch to a common alternative to model merging, multi-task learning. This approach trains a model on data from multiple datasets simultaneously. Without multi-task baselines, it’s difficult to fully assess the advantages of Localize-and-Stitch in scenarios where multi-task learning is also an option.

Why it matters: Model merging is a computationally efficient way to sharpen a model’s ability to perform certain tasks compared to multi-task learning, which requires training on all tasks. Localize-and-Stitch refines this process to achieve higher performance.

We’re thinking: This recipe adds spice to model soups!

Subscribe to The Batch