The words “big” and “large” have similar meanings, but they aren’t always interchangeable: You wouldn’t refer to an older, male sibling as your “large brother” (unless you meant to be cheeky). Choosing among words with similar meanings is critical in language tasks like translation.
What’s new: Google used a top language model to develop BLEURT, a way to compare translation models.
Background: Machine learning engineers typically evaluate a translation model’s ability to choose the right words by translating a sentence from one language to another and back again. The metric called BLEU quantifies how far the re-translation’s meaning has drifted from that of the original sentence. But BLEU, which scores similarity on a 0-to-1 scale using an n-gram method, often misses nuances. BLEURT does a better job by training a language model to predict the semantic similarity between different sequences of words.
Key insight: BERT is a general-purpose, unsupervised language model at the heart of many state-of-the-art systems. Fine-tuned on sentences that humans judge to be similar, it should learn to agree with human notions of similarity.
How it works: BLEURT uses BERT to extract feature vectors from an original sentence and its re-translation. A linear layer predicts their similarity.
- The researchers created a dataset of millions of sentence pairs. Each pair includes a sentence from Wikipedia and a version modified by randomly deleting some words and replacing others with similar ones.
- The researchers used BLEU and other techniques to estimate the similarity between these pairs.
- They pre-trained BLEURT to predict those measures of similarity.
- Then they fine-tuned it on a smaller set of human-annotated data to predict human similarity scores.
Results: The authors drew sentences from each of several datasets and created variations on them. BLEURT and BLEU ranked the similarity between each variation and the original, and the authors compared the Kendall Tau correlation, the percentage of pairs assigned the same order minus the percentage of pairs ordered differently, with the human ranking (which is given a score of 1.0). BLEURT achieved a Kendall Tau correlation of 0.338 while BLEU achieved 0.227 — a nice bump, although it leaves plenty of room for improvement.
Why it matters: Language models have improved by leaps and bounds in recent years, but they still stumble over context. Better word choices could improve not only automatic translation but the gamut of language tasks including chat, text summarization, sentiment analysis, question answering, and text classification.
We’re thinking: BLEU stands for Bilingual Evaluation Understudy. BERT stands for Bidirectional Encoder Representations from Transformers. Does anyone know what BLEURT stands for?