Expressive Synthetic Talking Heads Microsoft's VASA-1 delivers more lifelike talking-head videos

Published
Reading time
3 min read
Expressive Synthetic Talking Heads: Microsoft's VASA-1 delivers more lifelike talking-head videos

Previous systems that produce a talking-head video from a photo and a spoken-word audio clip animate the lips and other parts of the face separately. An alternative approach achieves more expressive results by animating the head as a whole.

What’s new: Sicheng Xu and colleagues at Microsoft developed VASA-1, a generative system that uses a facial portrait and spoken-word recording to produce a talking-head video with appropriately expressive motion. You can see its output here.

Key insight: When a person speaks, the facial expression and head position change over time, while the overall shapes of the face and head don’t. By learning to represent an image via separate embeddings for facial expression and head position — which change — as well as for facial structure in its 2D and 3D aspects — which don’t — a latent diffusion model can focus on the parts of the image that matter most. (Latent diffusion is a variant of diffusion that saves computation by processing a small, learned vector of an image instead of the image itself.)

How it works: VASA-1 comprises four image encoders (three 2D CNNs and one 3D CNN), one image decoder (another 2D CNN), Wav2Vec 2.0, and a latent diffusion image generator. The authors trained the system, given an image of a face and a recorded voice, to generate a series of video frames that conform to the voice. The training set was VoxCeleb2, which includes over 1 million short videos of celebrities talking. The authors added labels for gaze direction, head-to-camera distance, and an emotional intensity score computed by separate systems.

  • Given an image of a face, the encoders learned to generate embeddings that represented the 2D facial structure (which the authors call “identity”), 3D contours (“appearance”), head position, and facial expression. Given the embeddings, the decoder reconstructed the image. The authors trained the encoders and decoder together using eight loss terms. For instance, one loss term encouraged the system to reconstruct the image. Another encouraged the system, when it processes a different image of the same person (with different head positions and facial expressions), to produce a similar identity embedding.
  • Given a video, the trained encoders produced a sequence of paired head-position and facial-expression embeddings, which the authors call a “motion sequence.”
  • Given the accompanying voice recording, a pretrained Wav2Vec2  produced a sequence of audio embeddings. 
  • Given the audio embeddings that correspond to a series of consecutive frames, the latent diffusion model learned to generate the corresponding embeddings in the motion sequence. It also received other inputs including previous audio and motion sequence embeddings, gaze direction, head-to-camera distances, and emotional-intensity scores.
  • At inference, given an arbitrary image of a face and an audio clip, VASA produced the appearance and identity embeddings. Then it produced audio embeddings and motion-sequence embeddings. It generated the final video by feeding the appearance, identity, and motion sequence embeddings to its decoder.

Results: The authors measured their results by training a model similar to CLIP that produces a similarity score on how well spoken audio matches a video of a person speaking (higher is better). On the VoxCeleb2 test set, their approach produced a similarity score of 0.468 compared to 0.588 for real video. The nearest contender, SadTalker, which generates lip, eye, and head motions separately, achieved a similarity score of 0.441.

Why it matters: By learning to embed different aspects of a face separately, the system maintained the face’s distinctive, unchanging features while generating appropriate motions. This also made the system more flexible at inference: The authors demonstrated its ability to extract a video’s facial expressions and head movements and apply them to different faces.

We’re thinking: Never again will we take talking-head videos at face value!

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox