Models that learn relationships between images and words are gaining a higher profile. New research shows that adversarial learning, usually a way to make models robust to deliberately misleading inputs, can boost vision-and-language performance.
What’s new: Vision-and-language models based on transformer networks have shown strong performance on tasks such as answering questions about images. Zhe Gan of Microsoft and colleagues at Microsoft and the University of Maryland improved such models via Vision-and-Language Large-scale Adversarial (VILLA) training.
Key insight: Vision-and-language models often are pretrained, for instance, to fill in blanks in image captions, and then fine-tuned for a specific task, such as answering questions about images. Previous work with language models showed that adversarial fine-tuning — that is, giving the model input that’s designed to fool it and training it not to be fooled — can increase accuracy. The team extended this idea to vision-and-language models in both pretraining and fine-tuning.
How it works: The authors worked with UNITER, which has achieved state-of-the-art performance on several vision-and-language tasks. UNITER embeds images and text separately. Then it feeds the embeddings into a BERT-like model to create a multimodal embedding.
- The authors used a variation on FreeLB, an adversarial training technique. FreeLB perturbs embeddings by learning a small vector that, when added to embeddings, is likely to fool the network, and then training the model to answer correctly regardless.
- The authors perturbed both image and text embeddings, but not at the same time. The model’s objective was threefold: predict the correct answer using unperturbed embeddings, predict the correct answer using perturbed embeddings, and to keep those predictions and confidence in them close to one another.
- They pretrained UNITER to perform masked language modeling (guessing which words are missing from a text passage, usually based on surrounding words, but in this case based on an accompanying image) and image-text matching (guessing whether a text and image are paired). Pretraining involved four large image-and-caption datasets.
- They fine-tuned and tested on several vision-and-language tasks. For instance, visual question answering required answering questions about images like, “what color are her eyes?” Visual commonsense reasoning required answering multiple-choice questions such as, “why is [person4] pointing at [person1]?” followed by “I think so because…”
Results: UNITER trained with VILLA outperformed a standard UNITER in six vision-and-language tasks. In visual question answering, UNITER with VILLA answered 73.67 percent correctly, while the plain model answered 72.91 percent correctly. In the two-stage visual commonsense reasoning task of answering a question and justifying the answer, UNITER with VILLA scored 59.75 percent, while its standard counterpart succeeded 57.76 percent of the time.
Why it matters: We understand the world through several modalities, and that makes us smarter. For instance, to describe a tree, neither an image nor a biological description is sufficient, but together they have a revealing synergy. Current models still struggle to grasp the meaning of images and language individually, but they will always be missing something until they can draw connections between them.
We’re thinking: Vision: check. Language: check. Now sound, aroma, touch . . .