Language models can’t correct your misspellings or suggest the next word in a text without knowing what language you’re using. For instance, if you type “tac-,” are you aiming for “taco,” a hand-held meal in Spanish, or “taca,” a crown in Turkish? Apple developed a way to head off such cross-lingual confusion.
What’s new: It’s fairly easy to identify a language given a few hundred words, but only we-need-to-discuss-our-relationship texts are that long. Apple developed a way to tell, for example, Italian from Turkish based on SMS-length sequences of words.
Key insight: Methods for identifying languages in longer text passages take advantage of well studied statistical patterns among words. Detecting languages in a handful of words requires finding analogous patterns among letters.
How it works: The system comprises only a lightweight biLSTM and a softmax layer. This architecture requires half the memory of previous methods.
- A separate model narrows the possibilities by classifying the character set: Do the letters belong to Latin? Cyrillic? Hanzi? For instance, European languages and Turkish use the Latin alphabet, while Japanese and some Chinese languages use Hanzi.
- The biLSTM considers the order of input characters in both directions to squeeze out as much information as possible.
- Then it predicts the language based on the features it extracts.
Results: The system can spot languages in 50 characters as accurately as methods that require lots of text. Compared with Apple’s previous method based on an n-gram approach, the system improves average class accuracy on Latin scripts from 78.6 percent to 85.7 percent.
Why it matters: Mobile devices don’t yet have the horsepower to run a state-of-the-art multilingual language model. Until they do, they’ll need to determine which single-language model to call.
We’re thinking: Humans are sending more and more texts that look like this: ????????????. We hope NLP systems don’t go ????.