Diffusion transformers learn faster when they can look at embeddings generated by a pretrained model like DINOv2.
What’s new: Sihyun Yu and colleagues at Korea Advanced Institute of Science and Technology, Korea University, New York University, and Scaled Foundations (a startup that builds AI for robotics) proposed Representation Alignment (REPA), a loss term for transformer-based diffusion.
Key insight: Diffusion models learn to remove noise from images to which noise was added (and, at inference, they start with pure noise to generate a fresh image). This process can be divided into two parts: learning to (i) embed the noisy image and (ii) estimate the noise from the embedding. One way to accelerate learning is to add a loss term that encourages the diffusion model to produce embeddings that are similar to those produced by a pretrained embedding model. The diffusion model can learn to estimate the noise faster if it doesn’t need to learn how to embed an image from scratch.
How it works: The authors modified DiT-XL/2 and SiT-XL/2 transformer-based latent diffusion models, a class of diffusion models that subtract noise from embeddings rather than images. They trained the models to produce images similar to ImageNet. In the process, the modified models learned to produce embeddings similar to those produced by a pretrained DINOv2.
- The authors used Stable Diffusion VAE’s pretrained encoder to embed an image.
- Given the embedding with noise added, the diffusion model learned to remove the noise according to the usual loss term.
- It also learned according to the REPA loss. Specifically, it learned to maximize the cosine similarity between a specially processed version of its eighth-layer embedding and the embedding produced by a pretrained DINOv2. To process its eighth-layer embedding for the REPA loss, the diffusion model fed the embedding to a vanilla neural network.
- At inference, given pure noise, the model removed it over several steps to produce an image embedding. Stable Diffusion VAE’s decoder converted the embedding into an image.
Results: The modified DiT-XL/2 learned significantly faster than the unmodified version.
- In 400,000 training steps, the modified model reached 12.3 Fréchet inception distance (FID) (which measures similarity between generated and non-generated images, lower is better), while the unmodified version reached 19.5 FID.
- The models continued to learn at different speeds as training continued. The modified DiT-XL/2 took 850,000 training steps to reach 9.6 FID, while the unmodified version took 7 million steps to reach the same number.
- Experiments with modified and unmodified versions of SiT-XL/2 yielded similar results.
- Trained to convergence, the modified models outperformed the unmodified versions. For instance, the modified SiT-XL/2 achieved 5.9 FID (after 4 million training steps), while the unmodified version achieved 8.3 FID (after 7 million training steps).
Why it matters: Diffusion models and contrastive self-supervised models like DINOv2 have fundamentally different training objectives: One produces embeddings for the purpose of image generation, while the other’s embeddings are used for tasks like classification and semantic segmentation. Consequently, they learn different aspects of data. This work proposes a novel way to combine these approaches to produce more generally useful embeddings.
We’re thinking: It turns out that the REPA modification enabled diffusion models to produce embeddings better suited not only to diffusion but also to image classification and segmentation. A similar approach could lead to a more holistic framework for learning image representations.