The ability to generate realistic images without waiting would unlock applications from engineering to entertainment and beyond. New work takes a step in that direction.
What’s new: Dominic Rampas and colleagues at Technische Hochschule Ingolstadt and Wand Technologies released Paella, a system that uses a process similar to diffusion to produce Stable Diffusion-quality images much more quickly.
Key insight: An image generator’s speed depends on the number of steps it must take to produce an image: The fewer the steps, the speedier the generator. A diffusion model learns to remove varying amounts of noise from each training example; at inference, given pure noise, it produces an image by subtracting noise iteratively over a few hundred steps. A latent diffusion model reduces the number of steps to around a hundred by removing noise from a vector that represents the image rather than the image itself. Instead of a vector, using a selection of tokens from a predefined list makes it possible to do the same job in still fewer steps.
How it works: Like a diffusion model, Paella learned to remove varying amounts of noise from tokens that represented an image and then produced a new image from noisy tokens. It was trained on 600 million image-text pairs from LAION-Aesthetics.
- Given an image of 256x256 pixels, a pretrained encoder-decoder based on a convolutional neural network represented the image using 256 tokens selected from 8,192 tokens it had learned during pretraining.
- The authors replaced a random fraction of the tokens with tokens chosen from the list at random. This is akin to adding noise to an example in training a diffusion model.
- Given the image’s text description, CLIP, which maps corresponding text and images to the same embedding, generated an embedding for it. (The authors used CLIP’s text-image embedding capability only for ablation experiments.)
- Given the text embedding and the tokens with random replacements, a U-Net (a convolutional neural network) learned to generate all the original tokens.
- They repeated the foregoing steps 12 times, each time replacing a smaller fraction of the generated tokens. This iterative procedure trained the U-Net, guided by the remaining generated tokens, to remove a smaller amount of the remaining noise at each step.
- At inference, given a text prompt, CLIP generated an embedding. Given a random selection of 256 tokens, the U-Net regenerated all the tokens over 12 steps. Given the tokens, the decoder generated an image.
Results: The authors evaluated Paella (573 million parameters) according to Fréchet inception distance (FID), which measures the difference between the distributions of original and generated images (lower is better). Paella achieved 26.7 FID on MS-COCO. Stable Diffusion v1.4 (860 million parameters) trained on 2.3 billion images achieved 25.40 FID — somewhat better, but significantly slower. Running on an Nvidia A100 GPU, Paella took 0.5 seconds to produce a 256x256-pixel image in eight steps, while Stable Diffusion took 3.2 seconds. (The authors reported FID for 12 steps but speed for eight steps.)
Why it matters: Efforts to accelerate diffusion have focused on distilling models such as Stable Diffusion. Instead, the authors rethought the architecture to reduce the number of diffusion steps.
We’re thinking: The authors trained Paella on 64 Nvidia A100s for two weeks using computation supplied by Stability AI, the firm behind Stable Diffusion. It’s great to see such partnerships between academia and industry that give academic researchers access to computation.