Tell Me a Picture OpenAI's two new multimodal AI models, CLIP and DALL·E

Published

Jan 13, 2021

Reading time

2 min read

Two new models show a surprisingly sharp sense of the relationship between words and images.

What’s new: OpenAI, the for-profit research lab, announced a pair of models that have produced impressive results in multimodal learning: DALL ·E, which generates images in response to written prompts, and Contrastive Language-Image Pretraining (CLIP), a zero-shot image classifier. The company published a paper that describes CLIP in detail; a similar DALL·E paper is forthcoming.

How they work: Both models were trained on text-image pairs.

DALL·E (whose name honors both the painter Salvador Dalí and Pixar’s WALL·E) is a decoder-only transformer model. OpenAI trained it on images with text captions taken from the internet. Given a sequence of tokens that represent a text and/or image, it predicts the next token. Then it predicts the next token given its previous prediction and all previous tokens.
This allows DALL·E to generate images from a wide range of text prompts and to generate fanciful images that aren’t represented in its training data, such as “an armchair in the shape of an avocado.”
CLIP uses a text encoder (a modified transformer) and an image encoder (a vision transformer) trained on 400 million image-text pairs drawn from the internet. Using a contrastive loss function adopted from ConVIRT, it learned to predict which of nearly 33,000 text snippets would match an image.
At inference, CLIP accepts a list of all potential classes in the form of “a photo of a {object}.” Then, given an image, it returns the most likely class from the list. Since CLIP can predict which text, among any number and variety of texts, best matches an image, it can perform zero-shot classification in any image-classification task.

Yes, but: Neither model is immune to goofs. Asked to produce a pentagonal clock, for instance, DALL·E rendered some timepieces with six or seven sides. CLIP, meanwhile, has trouble counting objects in an image and differentiating subclasses like car brands or flower species.

Behind the news: The new models build on earlier research at the intersection of words and images. A seminal 2016 paper from University of Michigan and Max Planck Institute for Informatics showed that GANs could generate images from text embeddings. Other work has resulted in models that render images from text, among them Generative Engine and Text to Image. Judging by the examples OpenAI has published so far, however, DALL·E seems to produce more accurate depictions and to navigate a startling variety of prompts with flair.

Why it matters: As OpenAI chief scientist (and former post-doc in Andrew’s lab) Ilya Sutskever recently wrote in The Batch, humans understand concepts not only through words but through visual images. Plus, combining language and vision techniques could overcome computer vision’s need for large, well labeled datasets.

We’re thinking: If we ever build a neural network that exhibits a sense of wonder, we’ll call it GOLL·E.

Subscribe to The Batch