Faster, Cheaper Video Generation Pyramidal Flow Matching, a cost-cutting method for training video generators

Published

Oct 23, 2024

Reading time

3 min read

Researchers devised a way to cut the cost of training video generators. They used it to build a competitive open source text-to-video model and promised to release the training code.

What’s new: Yang Jin and colleagues at Peking University, Kuaishou Technology, and Beijing University of Posts and Telecommunications proposed Pyramidal Flow Matching, a method that reduced the amount of processing required to train video generators. They offer the code and a pretrained model that’s free for noncommercial uses and for commercial uses by developers who make less than $1 million in annual revenue.

Key insight: Models that generate output by starting with noise and removing it over several steps, such as diffusion and flow matching models, typically learn by removing noise from an embedding to which noise was added. Starting with a downsampled (smaller) version of the embedding and then upsampling (enlarging) it gradually throughout the process, hitting the full size near the end, saves processing during training and inference.

How it works: The authors’ system comprises a pretrained SD3 Medium image generator, an image autoencoder, and two pretrained text encoders: T5 and CLIP. They pretrained the autoencoder to reconstruct images and sequences of video frames, and trained SD3 Medium to remove noise from an embedding of eight video frames given both text embeddings and embeddings of previous sequences of frames. The training sets included WebVid-10M, OpenVid-1M, and Open-Sora Plan. The authors modified the typical process of removing noise from image embeddings in two ways: spatially and temporally.

Spatially: Given an embedding of eight video frames, SD3 Medium starts by removing noise on a heavily downsampled (very small) version of the embedding. After a number of noise-removal steps, the system increases the embedding size and adds further noise. It repeats these steps until SD3 is finished removing noise from the full-size embedding.
Temporally: When it’s removing noise from an embedding of eight frames, SD3 Medium receives downsampled versions of the previous embeddings it has generated. These embeddings start at the size of the current embedding and get progressively smaller for earlier frames. (They’re progressively smaller because the further they are from the current embedding, the less closely related they are to the current embedding.)
At inference: Given a prompt, T5 and CLIP produce text embeddings. Given the text embeddings, an embedding of pure noise, and previously denoised embeddings, SD3 Medium removes noise. Given the denoised embeddings from SD3 Medium, the autoencoder’s decoder turns them into a video.

Results: The authors compared their model to other open and closed models using VBench, a suite of benchmarks for comparing the quality of generated video. They also conducted a survey of human preferences. On VBench, their model outperformed other open models but slightly underperformed the best proprietary models, such as Kling. Human evaluators rated their model as superior to Open-Sora 1.2 for esthetics, motion, and adherence to prompts, and better than Kling for esthetics and adherence to prompts (but not motion). Furthermore, running on an Nvidia A100 GPU, their model took 20,700 hours to learn to generate videos up to 241 frames long. Running on a faster Nvidia H100 GPU, Open-Sora 1.2 took 37,800 hours to learn to generate 97 frames.

Why it matters: Video generation is a burgeoning field that consumes enormous amounts of processing. A simple way to reduce processing could help it scale to more users.

We’re thinking: Hollywood is interested in video generation. Studios reportedly are considering using the technology in pre- and post-production. Innovations that make it more compute-efficient will bring it closer to production.

Subscribe to The Batch