The technique of model merging combines separate models into a single, more capable model without further training, but it requires expertise and manual effort. Researchers automated the process.
What's new: Takuya Akiba and colleagues at Sakana, a research lab based in Tokyo, devised an automated method for merging models. It combines models trained for general tasks to produce models that perform well at the intersection of those tasks.
Key insight: Researchers have demonstrated various approaches to model merging. Earlier work showed that vision models of the same architecture can be combined with good results simply by averaging their corresponding weights, although subsequent studies revealed limitations in this approach. (When models have different architectures, averaging weights can combine parts they have in common.) An alternative is to stack layers drawn from different models. These methods can be varied and integrated to offer a wide variety of possible model combinations. An automated process that tries various combinations at random, finds the best performers among the resulting models, and recombines them at random can discover the high-performance combinations of these approaches without relying on intuition and experience.
How it works: The authors aimed to build a large language model that would solve problems in Japanese. They used the algorithm known as Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to merge the Japanese-language LLM Shisa-Gamma and two math-specific, English-language LLMs: Abel and WizardMath. All three models were fine-tuned from Mistral 7B, which was pretrained on text from the web.
- The authors produced dozens of 10 billion-parameter models by merging the three initial ones. They merged the models by (i) combining weights of two or more layers from each model according to TIES-Merging and DARE and (ii) stacking either the combined layers or the original ones.
- They evaluated the merged models on 1,069 examples translated into Japanese from GSM8k, which contains grade-school word problems.
- They saved the models that performed best and repeated the process more than 100 times, merging the saved models and measuring their performance. The final model was the one with the highest accuracy on the translated GSM8k examples.
Results: The authors evaluated their model on the Japanese subset of Multilingual Grade School Math (MGSM). The merged model achieved 55.2 percent accuracy. Among the source models, Abel achieved 30.0 percent accuracy, WizardMath 18.4 percent accuracy, and Shisa Gamma 9.6 percent accuracy. The merged model’s performance fell between that of GPT-3.5 (50.4 percent accuracy) and GPT-4 (78.8 percent accuracy), which presumably are an order of magnitude larger.
Why it matters: Combining existing models offers a way to take advantage of their strengths without further training. It can be especially valuable in building models at the intersection between tasks, such as understanding Japanese language and solving math problems.
We're thinking: In addition to building new models, how can we make best use of the ones we already have? Merging them may be an efficient option.