Recent work showed that models for multilingual machine translation can increase the number of languages they translate by scraping the web for pairs of equivalent sentences in different languages. A new study radically expanded the language repertoire through training on untranslated web text.
What’s new: Ankur Bapna, Isaac Caswell, and colleagues at Google collected a dataset of untranslated text that spans over 1,000 languages. Combining it with existing multilingual examples, they trained a model to translate many languages that are underrepresented in typical machine translation corpora.
Key insight: Neural networks typically learn to translate text from multilingual sentence pairs, known as parallel data. Generally this requires examples numbering in the millions, which aren’t available for the vast majority of language pairs. However, neural networks can also learn from untranslated text, also known as monolingual data, by training them to fill in a missing word in a sentence. Combined training on parallel and monolingual data — carefully filtered — can enable a model to translate among languages that aren’t represented in parallel data.
How it works: The authors scraped web text, classified the languages in it, and combined what was left with existing monolingual data. Separately, they used an established corpus of parallel data. Then they trained a transformer on the monolingual and parallel datasets.
- The authors trained a CLD3 vanilla neural network on an existing monolingual dataset to classify languages.
- The CLD3 classified 1,745 languages in the scraped text. The authors removed the languages that proved most difficult to classify. They combined the remainder with existing data to produce a monolingual corpus of 1,140 languages.
- They eliminated languages that the CLD3 had frequently confused with a different language. They removed sentences that the CLD3 (or a more computationally expensive language classifier) had failed to classify either correctly or as a related dialect. They also discarded sentences in which fewer than 20 percent of the words were among the language’s 800 most frequently used terms. Then they discarded languages for which the available text included fewer than 25,000 sentences. Finally, a team of native speakers designed criteria to remove sentences of closely related languages.
- They trained a transformer to fill in missing parts of sentences in the monolingual data. Simultaneously, they trained it to translate examples in an existing parallel dataset that comprised 25 billion sentence pairs in 102 languages. This enabled the transformer to render a rough English translation from any language in the corpora.
- Continuing to train the model on both monolingual and parallel data, the authors added parallel data formed by pairing monolingual text with translations generated by the model. In learning to translate (noisy) model-translated text into ground-truth text, the model learned to handle faulty grammar and usage. It also learned to translate from clean to noisy text. This forced it to translate among various languages more consistently and helped to avoid drastic, possibly damaging model updates.
Results: The authors compared their 1,000-language model with a version trained on 200 languages. Given a test set that comprised 38 languages, the 1000-language model performed better on most of them (including those for which plenty of training data was available), which suggests that greater language diversity was beneficial. When translating all languages into English, the 1000-language model outperformed the 200-language version by 2.5 CHRF points, a measure of overlap among groups of characters between generated and ground-truth translations. Translating from English to other languages, the 1,000-language version outperformed its 200-language counterpart by an average of 5.3 CHRF points.
Why it matters: Previous research cautioned against using monolingual data to expand a translator’s language repertoire. It was thought that training in languages that were less well-represented in the dataset would diminish performance on better-represented ones. Yet this model, trained largely on monolingual data, performed well across a variety of languages. The authors hypothesize that, once a model learns a critical number of languages, additional languages are helpful because they’re likely to share similarities with those the model already knows about.
We’re thinking: The authors went out of their way to filter out less-useful training data. Their results show that scraping the web indiscriminately only gets you so far. Rigorous curation can make a big difference.