Better Zero-Shot Translations A method for improving transformer NLP translation

Published

Feb 17, 2021

Reading time

2 min read

Train a multilingual language translator to translate between Spanish and English and between English and German, and it may be able to translate directly between Spanish and German as well. New work proposes a simple path to better machine translation between languages that weren’t explicitly paired during training.

What’s new: Danni Liu and researchers at Maastricht University and Facebook found that a small adjustment in the design of transformer networks improved zero-shot translations rendered by multilingual translators that are based on that architecture.

Key insight: Residual connections, which add the inputs of one layer to those of a later layer to prevent vanishing gradients, impose a one-to-one correspondence between the two layers they connect. Transformers use residual connections throughout, which imposes a one-to-one correspondence between the network’s input and output. That correspondence could preserve word order in representations extracted from a languages (for example, remembering that adjectives precede the nouns they describe), which causes problems for zero-shot translation if the output language orders adjectives and nouns differently. Removing residual connections in one layer should break the correspondence while preserving the benefits of residual connections in other layers.
How it works: The authors used a transformer and removed the residual connections from its encoder’s middle layer.

They trained the model on Europarl v7, IWSLT 2017, and PMIndia, which include texts in various languages paired with human translations into other languages.
The model learned to translate between 18 language pairs that always included English. Given an input sentence and a target output language, it optimized a loss based on how well each token (generally a word) it produced matched each token in a reference translation.
The authors tested the model on pairings of the languages used in training except English, giving them 134 zero-shot translation tasks.

Results: The authors compared their model’s zero-shot translations with those of an unmodified transformer using BLEU, a measure of how well a machine translation matches a reference translation (higher is better). On Europarl, removing residual connections boosted the average BLEU score from 8.2 to 26.7. On IWSLT, it raised the average from 10.8 to 17.7. On PMIndia, which includes low-resource languages, it lifted scores from 0.8 to 2.3.

Why it matters: The zero-shot approach opens doors in language translation. Many language pairs lack sufficient training data to train a translator via supervised learning. But if you have enough data for N languages, zero-shot allows for translation between N² language pairs.

We’re thinking: Residual connections are all you don’t need!

Subscribe to The Batch