Large transformer networks work wonders with natural language, but they require enormous amounts of computation. New research slashes processor cycles without compromising performance.
What’s new: Swetha Mandava and a team at Nvidia reduced the number of self-attention layers in transformer-based language models. Their Pay Attention When Required (Par) approach achieves results comparable to those of Transformer-XL and Bert in substantially less time.
Key insight: The self-attention layers in transformer networks are notoriously inefficient. Some of them can be replaced by higher-efficiency feed-forward layers.
How it works: The authors used differential neural architecture search (DNAS), following earlier work to optimize both error and processing latency. For each layer in a network, DNAS starts with a user-defined set of blocks and finds the likelihood that a particular block is the best choice for that layer. The authors searched for optimal networks of 32 and 24 layers, the numbers of layers in Transformer XL and Bert.
- Each layer included three block types: feed-forward, self-attention, and identity.
- The authors trained their 32-layer network on Transformer-XL’s training corpus, WikiText-103. They trained their 24-layer model on Bert’s training corpus, Wikipedia+Books.
- The training yielded a decision about the most effective block for each layer. The authors built optimized versions accordingly and dubbed them Par-Transformer and Par-Bert.
Results: Par-Transformer matched Transformer-XL in perplexity (a measure of a language model’s predictive accuracy). It used roughly one-third as many self-attention blocks and executed in one-third less time, making decisions in 9.9 milliseconds versus 15.2 milliseconds running on Nvidia A100 GPUs. Par-Bert similarly matched Bert’s perplexity in a slimmer model while cutting latency to 5.7 milliseconds from 8.6 milliseconds.
Why it matters: Improving the runtime performance of transformer architectures could encourage their use in novel tasks.
We’re thinking: Transformer networks have come a long way in a short time and continue to improve rapidly. What an exciting time for deep learning!