Calibrating Contrast X-CLR, an approach to contrastive learning for better vision models

Published

Jan 15, 2025

Reading time

2 min read

Contrastive loss functions make it possible to produce good embeddings without labeled data. A twist on this idea makes even more useful embeddings.

What’s new: Vlad Sobal and colleagues at Meta, New York University, Brown University, Genentech, and Canadian Institute for Advanced Research introduced X-Sample contrastive loss (X-CLR), a self-supervised loss function that enables vision models to learn embeddings that capture similarities and differences among examples with greater subtlety.

Key insight: Contrastive loss functions like SimCLR equally encourage a model to produce dissimilar embeddings of images of, say, a cat, a dog, and a dump truck. But, of course, cats and dogs are more similar to each other than either are to dump trucks. Instead of marking examples as similar or dissimilar, X-CLR assigns similarity scores, so a model can learn to produce embeddings that match those scores.

How it works: The authors used X-CLR to train an embedding model on Conceptual Captions datasets of image-text pairs scraped from the web: CC-3M (3 million text-image pairs) and CC-12M (12 million text-image pairs). The model was similar to CLIP, except the text encoder was a sentence transformer pretrained on sentence pairs, and the vision encoder was a ResNet-50 pretrained on ImageNet.

The sentence transformer embedded text captions for all examples. The system computed similarity scores according to cosine similarity between the text embeddings.
Similarly, a ResNet-50 computed image embeddings, and the system computed similarity scores between them.
The authors froze the sentence transformer and used the text similarity scores as labels in the loss function. The loss function minimized the difference between the similarity scores of the text embeddings and the corresponding similarity scores of the image embeddings.

Results: Systems trained using X-CLR outperformed competitors in ImageNet classification, especially when less training data was available. (The authors followed CLIP’s method of classification: They computed the similarity between an image embedding and text embeddings of all classes. The image’s classification was the class that corresponds to the text embedding with the highest similarity to the image embedding.)

The authors compared a system trained using X-CLR, one trained using SimCLR, and CLIP. After training on the CC-3M dataset, the X-CLR system achieved 58.2 percent accuracy on ImageNet, while the SimCLR model achieved 57.0 percent and CLIP achieved 41.0 percent.
Training on CC-12M resulted in smaller differences: X-CLR achieved 59.4 percent accuracy, SimCLR achieved 58.9 percent, and CLIP achieved 58.8 percent.

Why it matters: Contrastive loss functions are very useful, but the similar/dissimilar dichotomy leaves important nuances unaccounted for. Like CLIP, X-CLR takes advantage of both images and their captions for self-supervised learning. However, CLIP learns to recognize image-text pairs as similar or dissimilar, while X-CLR matches image-image pairs using captions as a similarity signal that’s continuous rather than discrete.

We’re thinking: Reality is not black and white. Allowing for shades of gray makes for better modeling.

Subscribe to The Batch