The emerging generation of trillion-parameter language models take significant computation to train. Activating only a portion of the network at a time can cut the requirement dramatically and still achieve exceptional results.
What’s new: Researchers at Google led by Nan Du, Yanping Huang, and Andrew M. Dai developed Generalized Models (GLaM), a trillion-parameter model for language tasks. Like the company’s earlier Switch, this work uses mixture-of-experts (MoE) layers to select which subset(s) of a network to use depending on the input. It provides a clearer picture of how MoE can save time and electricity in practical language tasks.
Key insight: A neural network’s parameter count entails a compromise between performance (bigger is better) and energy cost (smaller is better). MoE architectures use different subsets of their parameters to learn from different examples. Each MoE layer contains a group of vanilla neural networks, or experts, preceded by a gating module that learns to choose which ones to use based on the input, enabling different experts to specialize in particular types of examples. In this way, the network uses less energy and learns more than the size of any given subset might suggest.
How it works: The authors trained a transformer model equipped with MoE layers (similar to GShard) to generate the next word or part of a word in a text sequence using a proprietary 1.6-trillion-word corpus of webpages, books, social media conversations, forums, and news articles. They fine-tuned the model to perform 29 natural language tasks in seven categories such as question answering and logical reasoning.
- During training, each input token (a word of text) passed through an encoder made up of alternating self-attention and MoE layers.
- Each MoE layer starts with a gating module. Given a representation from the attention layer, it selects two experts (out of 64) and passes the representation to them. The pair of experts refine the representation separately, creating two new representations. The weighted average of those representations goes to the next self-attention layer.
- After the last attention layer, a fully connected layer computed the word most likely to follow the input. Since two out of 64 experts were active in any given MoE layer, the network used only 8 percent of its parameters to render each output token.
- At inference, the authors evaluated their approach on zero- and one-shot tasks. In zero-shot tasks, given a prompt, the model generated an output (for example, an answer to an unseen question). In one-shot tasks, it received a randomly selected example of a completed task from a training set along with an input, and generated an output. (For instance, the model received a paragraph, a question about it, and the correct answer, and then answered a new question about a different paragraph.)
Results: Training the 1.2 trillion-parameter GLaM required 456 megawatt hours, while the 175 billion-parameter GPT-3 required 1,287 megawatt hours. Moreover, GLaM outperformed GPT-3 in six categories of zero-shot tasks and in five categories for one-shot tasks. For example, answering trivia questions in the one-shot TriviaQA, it achieved 75 percent accuracy — a state-of-the-art result — compared to GPT-3’s 68 percent.
Why it matters: Increased computational efficiency means lower energy costs, presumably making it easier for everyday engineers to train state-of-the-art models. It also means reduced CO2 emissions, sparing the planet some of the environmental impact incurred by AI.
We’re thinking: MoE models are attracting a lot of attention amid the public-relations race to claim ever higher parameter counts. Yes, building a mixture of 64 experts boosts the parameter count by 64 times, but it also means building 64 models instead of one. While this can work better than building a single model, it also diverts attention from other architectures that may yield insights deeper than bigger is better.