Mixture of Experts Pulls Ahead Hunyuan-Large outshines open competitors with high benchmark scores

Published

Nov 13, 2024

Reading time

3 min read

A new open source large language model outperforms competitors, including the open-weights Llama 3.1 405B, on a variety of benchmarks.

What’s new: Tencent released Hunyuan-Large, a mixture-of-experts model with open code and open weights. It comes in base and instruction-tuned versions, both of which can process a relatively large input context window of 256,000 tokens. It’s free for developers outside the European Union who have fewer than 100 million monthly users. You can experiment with it here.

Mixture of experts (MoE) basics: The MoE architecture uses different subsets of its parameters to process different inputs. Each MoE layer contains a group of neural networks, or experts, preceded by a gating module that learns to choose which one(s) to use based on the input. In this way, different experts learn to specialize in different types of examples. Because not all parameters are used to produce any given output, the network uses less energy and runs faster than models of similar size that use all parameters to process every input.

How it works: Hunyuan-Large comprises 389 billion parameters but uses 52 billion parameters to process any given input. The team pretrained the model on 7 trillion tokens primarily of English and Chinese text, of which 5.5 trillion tokens came from unspecified sources and 1.5 trillion synthetic tokens were generated by unspecified large language models. The models used to generate training data were “specialized” to provide expert-level responses in various domains. The team fine-tuned Hunyuan-Large on unspecified datasets of instructions and human feedback.

MoE models typically select which expert(s) to use based on the input. Hunyuan-Large chooses one of 16 experts, but it also uses a shared expert — an expert that processes every input.
Recent research showed that there is a formula for the optimal learning rate based on the batch size (the number of examples a model sees during one training step). The shared expert and the chosen expert see a different amount of data in each training step, so the team modified the learning rate for the chosen expert based on that formula.

Results: The team compared the Hunyuan-Large models to four open source models and their instruction-tuned versions: Llama 3.1 70B, Llama 3.1 405B, and the MoE models Mixtral-8x22B and DeepSeek-V2.

Hunyuan-Large achieved the best performance on 15 of 19 benchmarks that test English, Chinese, math, and coding proficiency. For example, on MMLU (answering multiple choice questions in topics including elementary mathematics, history, computer science, and law), Hunyuan-Large achieved 88.4 percent accuracy. The next-best competitor, Llama 3.1 405B, achieved 85.2 percent.
The instruction-tuned version achieved the best performance on 10 of 13 benchmarks including measures of instruction-following ability and alignment with certain human preferences. For instance, Hunyuan-Large-Instruct maintained its dominance on MMLU (89.9 percent accuracy to Llama 3.1 405B Instruct’s 887.3 percent accuracy). On AlpacaEval 2, an instruction-following benchmark, Hunyuan-Large-Instruct achieved 51.8 percent, while the next-best competitor, DeepSeek 2.5 Chat, achieved 50.5 percent.

Why it matters: Hunyuan-Large generally outperforms Llama 405B, achieving the performance of a 405 billion parameter model while computing only 52 billion parameters. That’s a significantly lower processing requirement, and the model is free for many purposes.

We’re thinking: Setting aside Switch Transformer — a 1.6 trillion parameter behemoth that was built to test the limits of size rather than performance — Hunyuan-Large is among the largest MoE models we’ve come across. It’s an impressive demonstration of what larger MoE models can accomplish.

Subscribe to The Batch