A word-embedding model typically learns vector representations from a large, general-purpose corpus like Google News. But to make the resulting vectors useful in a specialized domain, such as veterinary medicine, they must be fine-tuned on a smaller, domain-specific dataset. Researchers from Facebook AI offer a more accurate method.
What’s new: Rather than fine-tuning, Piotr Bojanowski and colleagues developed a model that aligns word vectors learned from general and specialized corpora.
Key insight: The authors drew inspiration from the way multilingual word vectors are learned. They treated general-purpose and domain-specific corpora as separate languages and used a word-embedding model to learn independent vectors from each. Then they aligned the vectors from one corpus with those from another.
How it works: To align word vectors from two corpora, common words are used to find a consistent way to represent all words. For example, if one corpus is {human,cat} and the other is {cat,dog}, the model applies a transformation that unifies the dog word vectors while retaining the relative positions of the word vectors between cats, dogs, and humans.
- A word-embedding model learns independent word vectors from both corpora.
- For words that appear in both corpora, the alignment model learns a linear mapping from general-purpose vectors to domain-specific vectors. The mapping solves a linear equation that minimizes the distance between the general-purpose vectors and the domain-specific vectors.
- The authors use a loss function called RCSLS for training. RCSLS balances two objectives: General-purpose vectors that are close together remain close together, while general-purpose vectors that are far apart remain far apart.
- Common words in the two corpora now have duplicate vectors. Averaging them produces a single vector representation.
Results: The authors tested this approach to learning word vectors on tasks that include predicting analogies and text classification in a dataset where the test set has a slightly different word usage than the training set. Models that use word vectors learned via alignment outperformed those that use word vectors fine-tuned in the usual way. The new method’s advantage was more pronounced when the domain-specific dataset was relatively small.
Why it matters: Machine learning engineers need tools that enable existing word representations to capture specialized knowledge. The alignment technique could be a boon in any situation where general-purpose word vectors don’t capture the meanings at play.
We’re thinking: Open-source, pretrained word embeddings have been a boon to NLP systems. It would be great to have freely available word embeddings that captured knowledge from diverse fields like biology, law, and architecture.