OpenAI’s new video generator raises the bar for detail and realism in generated videos — but the company released few details about how it built the system.
What’s new: OpenAI introduced Sora, a text-to-video model that can produce extraordinarily convincing, high-definition videos up to one minute long. You can see examples here.
What we know: Sora is a latent diffusion model that learned to transform noise into videos using an encoder-decoder and transformer. The system was trained on videos up to 1,920x1,080 pixels and up to one minute long.
- Following DALL·E 3, OpenAI trained a video captioning model to enhance the captions of videos in the dataset, adding descriptive details.
- Given a video’s frames divided into patches, the encoder learned to embed the patches and further compress them along the time dimension, producing tokens. Given the tokens, the decoder learned to reconstruct the video.
- Given tokens that had been adulterated by noise and an enhanced prompt, the transformer learned to generate the tokens without noise.
- At inference, a separate transformer enhanced input prompts to be more descriptive. Given the enhanced prompt and noisy tokens, Sora’s transformer removed the noise. Given the denoised tokens, the decoder produced a video.
What we don’t know: OpenAI is sharing the technology with outside researchers charged with evaluating its safety, The New York Times reported. Meanwhile, the company published neither quantitative results nor comparisons to previous work. Also missing are detailed descriptions of model architectures and training methods. (Some of the results suggest that Sora was trained not only to remove noise from tokens, but also to predict future tokens and generate tokens in between other tokens.) No information is available about the source(s) of the dataset or how it may have been curated.
Qualitative results: Sora’s demonstration output is impressive enough to have sparked arguments over the degree to which Sora “understands” physics. A photorealistic scene in which “a stylish woman walks down a Tokyo street filled with warm glowing neon” shows a crowded shopping district filled with believable pedestrians. The woman’s sunglasses reflect the neon signs, as does the wet street. Halfway through its one-minute length, the perspective cuts — unprompted and presumably unedited — to a consistent, detailed close-up of her face. In another clip, two toy pirate ships bob and pitch on a frothing sea of coffee, surrounded by a cup’s rim. The two ships maintain their distinctiveness and independence, their flags flutter in the same direction, and the liquid churns fantastically but realistically. However, as OpenAI acknowledges, the outputs on display are not free of flaws. For instance, the pirate-battle cup’s rim, after camera motion has shifted it out of the frame, emerges from the waves. (Incidentally, the Sora demos are even more fun with soundtracks generated by Eleven Labs.)
Why it matters: While we’ve seen transformers for video generation, diffusion models for video generation, and diffusion transformers for images, this is an early implementation of diffusion transformers for video generation (along with a recent paper). Sora shows that diffusion transformers work well for video.
We’re thinking: Did Sora learn a world model? Learning to predict the future state of an environment, perhaps given certain actions within that environment, is not the same as learning depict that environment in pixels — just like the ability to predict that a joke will make someone smile is different than the ability to draw a picture of that smile. Given Sora’s ability to extrapolate scenes into the future, it does seem to have some understanding of the world. Its world model is also clearly flawed — for instance, it will synthesize inconsistent three-dimensional structures — but it’s a promising step toward AI systems that comprehend the 3D world through video.