On the heels of systems that generate video directly from text, new work uses text to adjust the imagery in existing videos.
What’s new: Patrick Esser and colleagues at Runway unveiled Gen-1, a system that uses a text prompt or image to modify the setting (say, from suburban yard to fiery hellscape) or style (for instance, from photorealism to claymation) of an existing video without changing its original shapes and motions. You can see examples and request access here.
Key insight: A video can be considered to have what the authors call structure (shapes and how they move) and content (the appearance of each shape including its color, lighting, and style). A video generator can learn to encode structure and content in separate embeddings. At inference, given a clip, it can replace the content embedding to produce a video with the same structure but different content.
How it works: Gen-1 generates video frames much like a diffusion model, and the authors trained it following the typical diffusion-model training procedure: Add to each training example varying amounts of noise — nearly up to 100 percent — then train the model to remove it. To generate a video frame, the model starts with 100 percent noise and, guided by a text prompt or image, removes it over several steps. The system used three embeddings: (i) a frame embedding for each video frame (to which noise was added and removed), (ii) a structure embedding for each video frame, and (iii) a content embedding for the entire clip. The dataset comprised 6.4 million eight-frame videos and 240 million images, which the system treated as single-frame videos.
- During training, given an input video, the encoder component of a pretrained autoencoder produced a frame embedding for each video frame. The authors added a consistent amount of noise to each frame embedding.
- Given a video frame, a pretrained MiDaS extracted a depth map, an image that outlines shapes without colors — in other words, the video frame’s structure. The encoder embedded the depth map to produce a structure embedding for each frame.
- Given one video frame selected at random, a pretrained CLIP, which maps corresponding text and images to the same embedding, created a content embedding. The authors used a single content embedding for the entire video, rather than one for each frame, to ensure that it didn’t determine the structure of each frame.
- Given the frame embeddings (with added noise), structure embeddings, and single content embedding, a modified U-Net learned to estimate the added noise.
- At inference, CLIP received a text prompt or image and generated its own embedding. This replaced the content embedding. For each video frame to be generated, the system received a random — that is, 100 percent noise — frame embedding. Given the noisy frame embeddings, the structure embeddings, and CLIP’s embedding, the U-Net removed the noise over several steps.
- Given the denoised embeddings, the decoder constructed the video frames.
Results: Five human evaluators compared Gen-1 to SDEdit, which alters each frame individually. Testing 35 prompts, the evaluators judged Gen-1’s output to better reflect the text 75 percent of the time.
Why it matters: Using different embeddings to represent different aspects of data gives Gen-1 control over the surface characteristics of shapes in a frame without affecting the shapes themselves. The same idea may be useful in manipulating other media types. For instance, MusicLM extracted separate embeddings for large-scale composition and instrumental details. A Gen-1-type system might impose one musical passage’s composition over another’s instruments.
We’re thinking: Gen-1 doesn’t allow changes in objects in a frame, such as switching the type of flower in a vase, but it does a great job of retaining the shapes of objects while changing the overall scenery. The authors put this capability to especially imaginative use when they transformed books standing upright on a table into urban skyscrapers.