Open Video Gen Closes the Gap Tencent releases HunyuanVideo, an open source model rivaling commercial video generators

Published
Reading time
2 min read
A GIF with scenes of a man at a café, a working robot, a ghost in a mirror, and a speeding truck.
Loading the Elevenlabs Text to Speech AudioNative Player...

The gap is narrowing between closed and open models for video generation.

What’s new: Tencent released HunyuanVideo, a video generator that delivers performance competitive with commercial models. The model is available as open code and open weights for developers who have less than a 100 million monthly users and live outside the EU, UK, and South Korea.

How it works: HunyuanVideo comprises a convolutional video encoder-decoder, two text encoders, a time-step encoder, and a transformer. The team trained the model in stages (first the encoder-decoder, then the system as a whole) using undisclosed datasets before fine-tuning the system.

  • The team trained the encoder-decoder to reconstruct images and videos.
  • They trained the system to remove noise from noisy embeddings of videos. They started with low-resolution images; then higher-resolution images; then low-resolution, shorter videos; and  progressively increased to higher-resolution, longer videos.
  • Given a video, the encoder embedded it. Given a text description of the video, a pretrained Hunyuan-Large produced a detailed embedding of the text and a pretrained CLIP produced a general embedding. A vanilla neural network embedded the current timestep. Given the video embedding with added noise, the two text embeddings, and the time-step embedding, the transformer learned to generate a noise-free embedding.
  • The team fine-tuned the system to remove noise from roughly 1 million video examples that had been curated and annotated by humans to select those with the most aesthetically pleasing and compelling motions.
  • At inference, given pure noise, a text description, and the current time step, the text encoders embed the text and the vanilla neural network embeds the time step. Given the noise, text embeddings, and the time-step embedding, the transformer generates a noise-free embedding, and the decoder turns it back into video.

Results: 60 people judged responses to 1,533 text prompts by HunyuanVideo, Gen-3 and Luma 1.6. The judges preferred HunyuanVideo’s output overall. Examining the systems’ output in more detail, they preferred HunyuanVideo’s quality of motion but Gen-3’s visual quality.

Behind the news: In February, OpenAI’s announcement of Sora (which was released as this article was in production) marked a new wave of video generators that quickly came to include Google Veo, Meta Movie Gen, Runway Gen-3 Alpha, and Stability AI Stable Video Diffusion. Open source alternatives like Mochi continue to fall short of publicly available commercial video generators.

Why it matters: Research in image generation has advanced at a rapid pace, while progress in video generation has been slower. One reason may be the cost of processing, which is especially intensive when it comes to video. The growing availability of pretrained, open source video generators could accelerate the pace by relieving researchers of the need to pretrain models and enabling them to experiment with fine-tuning and other post-training for specific tasks and applications.

We’re thinking: Tencent’s open source models are great contributions to research and development in video generation. It’s exciting to see labs in China contributing high-performance models to the open source community!

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox