Generative adversarial networks make amazingly true-to-life pictures, but their output largely has been limited to still images — until now. Get ready for generated videos.
What’s new: A team from DeepMind offers Dual Video Discriminator GAN, a network that produces eerily lifelike videos out of thin air.
Key insight: DVD-GAN generates both realistic levels of detail and amounts of movement. Aidan Clark, Jeff Donahue, and Karen Simonyan accomplish these twin goals by dedicating a separate adversarial discriminator to each.
How it works: DVD-GAN modifies the state-of-the-art architecture for single images called BigGAN to produce a coherent series of frames. It includes a generator to create frames, a spatial discriminator that makes sure frames look good, and a temporal discriminator that makes sure successive frames go together. As in any GAN, the discriminators attempt to distinguish between real videos and generated videos while the generator tries to fool the discriminators.
- A recurrent layer transforms input noise into shape features to feed the generator. It learns to adjust the features incrementally for each frame, ensuring that the frames follow a naturalistic sequence rather than a succession of random images.
- The spatial discriminator tries to distinguish between real and generated frames by examining their content and structure. It randomly samples only eight frames per video to reduce the computational load.
- The temporal discriminator analyzes the common elements like object positions and appearances from frame to frame. It shrinks image resolution to further economize on computation. The downsampling doesn’t degrade details, since the spatial discriminator scrutinizes them separately.
- The generator is trained adversarially against both discriminators, learning to produce high detail frames and a realistic sequence.
Results: DVD-GAN generates its most realistic results based on the Kinetics dataset of 650,000 brief video clips focusing on human motion. Nonetheless, on the smaller UCF-101 set of action clips, it scores 33 percent higher than the previous state-of-the-art inception score, a measure of generated uniqueness and variety.
Yes, but: The current version maxes out at 4 seconds, and it generates lower-resolution output than conventional GANs. “Generating longer and larger videos is a more challenging modeling problem,” the researchers say.
Why it matters: GANs have led the way to an exciting body of techniques for image synthesis. Extending this to video will open up still more applications.