Selective Attention More efficient NLP training without sacrificing performance

Published

Nov 18, 2020

Reading time

1 min read

Large transformer networks work wonders with natural language, but they require enormous amounts of computation. New research slashes processor cycles without compromising performance.

What’s new: Swetha Mandava and a team at Nvidia reduced the number of self-attention layers in transformer-based language models. Their Pay Attention When Required (Par) approach achieves results comparable to those of Transformer-XL and Bert in substantially less time.

Key insight: The self-attention layers in transformer networks are notoriously inefficient. Some of them can be replaced by higher-efficiency feed-forward layers.

How it works: The authors used differential neural architecture search (DNAS), following earlier work to optimize both error and processing latency. For each layer in a network, DNAS starts with a user-defined set of blocks and finds the likelihood that a particular block is the best choice for that layer. The authors searched for optimal networks of 32 and 24 layers, the numbers of layers in Transformer XL and Bert.

Each layer included three block types: feed-forward, self-attention, and identity.
The authors trained their 32-layer network on Transformer-XL’s training corpus, WikiText-103. They trained their 24-layer model on Bert’s training corpus, Wikipedia+Books.
The training yielded a decision about the most effective block for each layer. The authors built optimized versions accordingly and dubbed them Par-Transformer and Par-Bert.

Results: Par-Transformer matched Transformer-XL in perplexity (a measure of a language model’s predictive accuracy). It used roughly one-third as many self-attention blocks and executed in one-third less time, making decisions in 9.9 milliseconds versus 15.2 milliseconds running on Nvidia A100 GPUs. Par-Bert similarly matched Bert’s perplexity in a slimmer model while cutting latency to 5.7 milliseconds from 8.6 milliseconds.

Why it matters: Improving the runtime performance of transformer architectures could encourage their use in novel tasks.

We’re thinking: Transformer networks have come a long way in a short time and continue to improve rapidly. What an exciting time for deep learning!

Subscribe to The Batch