Familiar Faces, Synthetic Soundtracks Meta debuts Movie Gen for text-to-video generation with consistent characters

Published

Oct 09, 2024

Reading time

4 min read

Meta upped the ante for text-to-video generation with new systems that produce consistent characters and matching soundtracks.

What’s new: Meta presented Movie Gen, a series of four systems that generate videos, include consistent characters, alter generated imagery, and add matching sound effects and music. Movie Gen will be available on Instagram in 2025. Meanwhile, you can view and listen to examples here. The team explains how the model was built an extensive 92-page paper.

Generated videos: Movie Gen Video can output 256 frames (up to 16 seconds at 16 frames per second) at 1920x1080-pixel resolution. It includes a convolutional neural network autoencoder, transformer, and multiple embedding models.

Movie Gen Video produces imagery by flow matching, a technique related to diffusion. It learned to remove noise from noisy versions of images and videos given matching text descriptions from 1 billion image-text pairs and 100 million video-text pairs. At inference, it starts with pure noise and generates detailed imagery according to a text prompt.
The system concatenates multiple text embeddings to combine the strengths of different embedding models. UL2 was trained on text-only data, so its embeddings may provide “reasoning abilities,” according to the authors. Long-prompt MetaCLIP was trained to produce similar text and image representations, so its embeddings might be useful for “cross-modal generation.” ByT5 produces embeddings of individual text elements such as letters, numbers, and symbols; the system uses it when a prompt requests text within a clip.

Consistent characters: Given an image of a face, a fine-tuned version of Movie Gen Video generates a video that depicts a person with that face.

To gather a training dataset for this capability, the team filtered Movie Gen Video’s pretraining dataset for clips that show a single face and consecutive frames are similar to one another. They built video-face examples by pairing each clip with a frame selected from the clip at random. To train the system, the team fed it text, the clip with added noise, and the single-frame face. It learned to remove the noise.
Trained on this data alone, the system generated videos in which the person always faces the camera. To expand the variety of poses, they further trained it on examples that substituted the faces in the previous step with generated versions with alternate poses and facial expressions.

Altered clips: The team modified Movie Gen Video’s autoencoder to accept an embedding of an alteration — say, changing the background or adding an object. They trained the system to alter videos in three stages:

First, they trained the system, given a starting image and an instruction to alter it, to produce an altered image.
They further trained the system to produce altered clips. They generated two datasets of before-and-after clips based on instructions. (i) For instance, given a random frame and an instruction to, say, replace a person with a cat, the system altered the frame accordingly. Then the team subjected both frames to a series of augmentations selected at random, creating matching clips, one featuring a person, the other featuring a cat. Given the initial clip and the instruction, the system learned to generate the altered clip. (ii) The team used DINO and SAM 2 to segment clips. Given an unsegmented clip and an instruction such as “mark <object> with <color>,” the system learned to generate the segmented clip.
Finally, they trained the system to restore altered clips to their original content. They built a dataset by taking a ground-truth clip and using their system to generate an altered version according to an instruction. Then Llama 3 rewrote the instruction to modify the altered clip to match the original. Given the altered clip and the instruction, the system learned to generate the original clip.

Synthetic soundtracks: Given a text description, a system called Movie Gen Audio generates sound effects and instrumental music for video clips up to 30 seconds long. It includes a DACVAE audio encoder (which encodes sounds that comes before and/or after the target audio), Long-prompt MetaCLIP video encoder, T5 text encoder, vanilla neural network that encodes the current time step, and transformer.

Movie Gen Audio learned to remove noise from noisy versions of audio associated with 1 million videos with text captions. At inference, it starts with pure noise and generates up to 30 seconds of audio at once.
At inference, it can extend audio. Given the last n seconds of audio, the associated portion of a video, and a text description, it can generate the next 30 - n seconds.

Results: Overall, Movie Gen achieved performance roughly equal to or better than competitors in qualitative evaluations of overall quality and a number of specific qualities (such as “realness”). Human evaluators rated their preferences for Movie Gen or a competitor. The team reported the results in terms of net win rate (win percentage minus loss percentage) between -100 percent and 100 percent, where a score above zero means that a system won more than it lost.

For overall video quality, Movie Gen achieved a net win rate of 35.02 percent versus Runway Gen3, 8.23 percent versus Sora (based on the prompts and generated clips available on OpenAI’s website), and 3.87 percent versus Kling 1.5.
Generating clips of specific characters, Movie Gen achieved a net win rate of 64.74 percent versus ID-Animator, the state of the art for this capability.
Generating soundtracks for videos from the SReal SFX dataset, Movie Gen Audio achieved a net win rate between 32 percent and 85 percent compared to various video-to-audio models.
Altering videos in the TGVE+ dataset, Movie Gen beat all competitors more than 70 percent of the time.

Why it matters: With Movie Gen, table stakes for video generation rises to include consistent characters, soundtracks, and various video-to-video alterations. The 92-page paper is a valuable resource for builders of video generation systems, explaining in detail how the team filtered data, structured models, and trained them to achieve good results.

We’re thinking: Meta has a great track record of publishing both model weights and papers that describe how the models were built. Kudos to the Movie Gen team for publishing the details of this work!

Subscribe to The Batch