Diffusion models usually take many noise-removal steps to produce an image, which takes time at inference. There are ways to reduce the number of steps, but the resulting systems are less effective. Researchers devised a streamlined approach that doesn’t sacrifice output quality.
What’s new: Kevin Frans and colleagues at UC Berkeley introduced shortcut models that learn to take larger noise-removal steps and thus require fewer steps to generate an image.
Key insight: At inference, a scheduler like Euler can enable a model to take larger steps than those it learned during training, but this approach yields worse performance. Alternatively distillation, in which a student model learns to remove the same amount of noise as a teacher model when it takes several steps, offers improved performance at the cost of more cumbersome development. Training the model directly to take bigger steps — that are equivalent to multiple smaller steps — enables it to maintain high performance while taking fewer steps.
How it works: The authors trained DiT-B, a diffusion transformer, to generate images like those in CelebA-HQ (celebrity faces) and ImageNet-256 (various subjects, size 256x256).
- The loss function included terms for flow matching and self-consistency. The flow matching term encouraged the model to learn to remove noise. The self-consistency term encouraged the model to learn how to minimize the discrepancy between the noise removed by a single big step and two smaller steps.
- Initially the model learned to combine two small steps into one step 2x as large. Combining two larger steps resulted in step sizes of 4x, 8x, and so on, up to 128x.
- At inference, the user told the model how many small steps to take, and the model computed the single-step size necessary to accomplish that.
Results: The authors compared their model using 1, 4, or 128 steps to alternatives that were trained via various methods including many variants of distillation. They measured the results using Fréchet inception distance (FID), which assesses how closely generated images resemble real-world images (lower is better).
- On both CelebA-HQ and ImageNet-256, their model, when it took four steps, achieved the best performance. For example, on CelebA-HQ, using four steps, the shortcut model achieved 13.8 FID, while the next-best model, Reflow (another variant of distillation), achieved 18.4 FID.
- When it took one step, it achieved the second-best result, behind progressive distillation, which trained a series of student models to remove the same amount of noise as a teacher model does when it takes multiple steps.
Why it matters: Generating images by diffusion is typically costly, and previous approaches to cutting the cost have compromised either performance or incurred additional development expense or both. This method achieves high performance at relatively low cost.
We’re thinking: As diffusion models continue to become cheaper and faster, we expect to see applications blossom!