Efficient Subject Consistency For Stable Diffusion

Published

Jul 16, 2024

Reading time

3 min read

Published in mid-2022, DreamBooth enables Stable Diffusion to depict variations on a given subject; say, a particular dog and the same dog with angel wings or wearing a chef’s hat. But it takes a lot of processing. An alternative approach achieves comparable results with far less computation.

What's new: Nataniel Ruiz and colleagues at Google proposed HyperDreamBooth, a compute-efficient way to customize text-to-image diffusion models to produce images of a specific subject (in this work, a specific face).

Key insight: The original DreamBooth approach fine-tunes Stable Diffusion to generate an image from a prompt that includes a unique identifier (for instance, “a [V] dog” or “a [V] face,” where [V] is a rarely used token that, in the fine-tuning dataset, appears in captions of images that depict a particular subject). Given a prompt that includes the identifier and describes a specific setting (such as, “a [V] dog wearing a chef’s hat”), the fine-tuned model generates the subject in that setting. To reduce the computation required, prior to fine-tuning, HyperDreamBooth trains an auxiliary network (called a hypernetwork) to predict the change in the image generator’s weights necessary to generate a particular sort of subject. This prediction provides a starting point for the image generator to produce images of a specific subject.

How it works: The authors trained a hypernetwork to predict a change in weights for Stable Diffusion to produce images of faces. Then they fine-tuned this change in weights to produce a specific face. The training dataset comprised 15,000 face images from CelebA-HQ.

Following LoRA, a fine-tuning method that reduces the number of parameters that need to be updated, the authors modified Stable Diffusion by adding trainable weight matrices to each attention layer. (After training, these matrices were added to the Stable Diffusion’s weights; that is, they specified changes in the weights, not the weights themselves.) Where LoRA multiplies two trainable weight matrices per layer, the authors approximated each LoRA matrix using two smaller matrices, one of which was frozen, the other trainable. This method further reduced the number of parameters to be learned by an order of magnitude beyond the reduction achieved by vanilla LoRA.
For each image in the dataset, the authors fine-tuned the LoRA-style matrices, minimizing the difference between an image generated by Stable Diffusion using those matrices and the actual image. Then they saved the matrices’ weights.
The authors trained the hypernetwork — made up of a small two-layer transformer and ViT-Huge and — to compute values for the trainable LoRA-style matrices. Given an image of a face, the hypernetwork learned to (i) minimize the difference between an original image and the system’s output when using weights computed by the hypernetwork and (ii) match the corresponding weights for the LoRA-style matrices saved in the previous step.
At inference, given an image of the subject, the hypernetwork predicted an initial change in Stable Diffusion’s weights.
To further improve Stable Diffusion’s faithfulness to the image of the subject, the authors fine-tuned the hypernetwork’s predicted change in weights over 40 iterations using a single image. This step minimized the difference between the generated and actual images. They added the fine-tuned change to Stable Diffusion’s weights.
Given a prompt to reproduce the subject with further description, the modified diffusion model produces an image that both depicts the subject and illustrates the prompt.

Results: The authors modified 25 face images according to prompts such as “An Instagram selfie of a [V] face” and “A Pixar character of a [V] face” using HyperDreamBooth, DreamBooth, and Textual Inversion, which learns an embedding of a subject given a few example images and uses the embedding to generate the same subject in new settings. They asked human judges (five per image) which of the generated images they preferred. The judges preferred HyperDreamBooth to DreamBooth 64.8 percent of the time. They preferred HyperDreamBooth to Textual Inversion 70.6 percent of the time. The authors’ method worked in roughly 20 seconds, 25 times faster than DreamBooth and 125 times faster than Textual Inversion.

Why it matters: Using hypernetworks to generate weights for a target network is not new. Neither is using LoRA for fine-tuning. Combining the two is. In this case, the combination results in a generative image-editing system (the hypernetwork plus the modified Stable Diffusion) that delivers useful functionality much faster than its predecessors.

We're thinking: We wonder how generalizable this approach is in practice. If the authors had trained their hypernetwork on a wider variety of images, would it have worked with other types of subject matter besides faces?

Subscribe to The Batch