A novel twist on self-supervised learning aims to improve on earlier methods by helping vision models learn how parts of an image relate to the whole.
What’s new: Mahmoud Assran and colleagues at Meta, McGill University, Mila, and New York University developed a vision pretraining technique that’s designed to address weaknesses in typical masked image modeling and contrastive learning approaches. They call it Image-based Joint-Embedding Predictive Architecture (I-JEPA).
Key insight: Masked image modeling trains models to reconstruct hidden or noisy patches of an image. This encourages models to learn details of training images at the expense of larger features. On the other hand, contrastive approaches train models to create similar embeddings for distorted or augmented versions of the same image. This encourages models to learn larger features, but reliance on augmentations such as zooming and cropping biases models toward those variations versus the wider variety they’re likely to encounter in the wild. I-JEPA combines these approaches: The model learns to embed regions that are made up of many patches, some of them masked, based on the surrounding unmasked patches. This approach balances learning of low- and high-level features.
How it works: I-JEPA used three components: (i) A target encoder embedded an image’s target region, (ii) a context encoder embedded the surrounding area, and (iii) a smaller predictor network, given the context embedding, tried to produce an embedding similar to that of the target embedding. All three components were transformers, though other architectures would serve. They were pretrained jointly on ImageNet-1k.
- Given an image, the system split it into non-overlapping patches.
- It randomly selected 4 (potentially overlapping) rectangular target regions, each of which was made up of contiguous patches covering 15 percent to 20 percent of the image. The target encoder produced embeddings for the target regions.
- The system randomly chose a context region (a square crop containing 85 percent to 100 percent of the image). It masked any patches in the context region that overlapped with the target regions. Given the masked context region, the context encoder produced an embedding of each patch in the context region and its position.
- Given the context embeddings and the masked patch embeddings of a target region, the predictor produced an embedding for each patch in the target region.
- For each patch in each target region, the system minimized the difference between the target embedding and predictor embedding.
- The authors froze the target encoder, added a linear classifier on top of it, and trained the classifier to label 1 percent of ImageNet-1k (roughly 12 images per class).
Results: An I-JEPA classifier that used ViT-H/14 encoders achieved 73.3 percent accuracy after about 2,500 GPU-hours of pretraining. A classifier trained on top of a ViT-B/16 base model that had been pretrained for 5,000 GPU-hours using the iBOT method, which relies on hand-crafted augmentations, achieved 69.7 percent accuracy. MAE, a masked modeling rival based on ViT-H/14, achieved 71.5 percent accuracy but required over 10,000 GPU-hours of pretraining.
Why it matters: In deep learning for computer vision, there’s a tension between learning details (a specialty of masked image modeling approaches) and larger-scale features (a strength of contrastive methods). I-JEPA gives models more context for learning both details and the high-level features in the training set.
We’re thinking: Given a picture of a jungle, I-JEPA would see both the forest and the trees!