Human Action in 3D Stanford researchers use generated video to animate 3D interactions without motion capture

Published

Apr 02, 2025

Reading time

3 min read

AI systems designed to generate animated 3D scenes that include active human characters have been limited by a shortage of training data, such as matched 3D scenes and human motion-capture examples. Generated video clips can get the job done without motion capture.

What’s new: A team led by Hongjie Li, Hong-Xing Yu, and Jiaman Li at Stanford University developed Zero-Shot 4D Human-Scene Interaction (ZeroHSI), a method that animates a 3D human figure interacting with a particular 3D object in a selected 3D scene. You can see its output here.

Key insight: Earlier approaches attempted to build a generalized approach: given a 3D scene, a text prompt, and motion-capture data, a diffusion model learned to alter the positions and rotations of human joints and objects over time. But if the system is designed to learn a 3D animation for a specific example motion, videos can stand in for motion capture. Current video generation models can take an image of a scene and generate a clip of realistic human motion and interactions with a wide variety of objects within it. From there, we can minimize the difference between the video frames and images of actions within the scene.

How it works: ZeroHSI takes a pre-built 3D scene that includes a 3D human mesh and 3D object. It uses a rendered image of the scene to generate a video. Then it uses the video to help compute the motions of a human figure and object within the scene.

The authors fed ZeroHSI a 3D scene complete with 3D human mesh and 3D object. ZeroHSI rendered an image of the scene, viewed from a default camera pose, using Gaussian splatting.
ZeroHSI fed the rendered image, along with a prompt that described a human interacting with an object in the scene (“the person is playing guitar while sitting on the sofa”), to Kling, an image-to-video generator. Kling produced a video clip.
For each generated video frame, ZeroHSI rendered a new image of the 3D scene and minimized a loss function with four terms. It used the loss function to calculate how to change the poses of the 3D human, 3D object, and camera in the 3D scene to match their poses in the video frame. For example, one loss term minimized pixel-level differences between the image and video frame. Another minimized the difference between the object’s center in the image and in a segmentation mask of the video frame produced by SAM 2.
The system sometimes produced errors. For instance, one of the human figure’s hands might fail to touch the object, or the object penetrated the human figure’s body. To remedy this, for each video frame, the authors refined the poses in a separate phase that involved three loss terms. For instance, one term minimized the distance between surfaces of a hand and the object to prevent penetration or distance between them.

Results: The authors evaluated ZeroHSI using a proprietary dataset of 12 3D scenes that included a human figure and an object and between one and three text prompts that described interactions between the human and object and/or scene. In 100 evaluations, ZeroHSI outperformed LINGO, a diffusion model trained on matched 3D scene, 3D object, and human motion-capture data that had achieved the previous state of the art.

ZeroHSI achieved 24.01 average CLIP Score, which measures how well text descriptions match images (higher is better), while LINGO achieved a 22.99 average CLIP Score. ZeroHSI achieved 0.033 average object penetration depth, a measure of plausibility in physical interactions (lower is better), while LINGO achieved 0.242 average object penetration depth.
400 participants judged whether they preferred ZeroHSI or LINGO with respect to realism and how well their output aligned with the prompt. 86.9 percent preferred ZeroHSI for realism, and 89.1 percent preferred ZeroHSI for how well its output matched the prompt.

Why it matters: Learning from motion-capture data is problematic in a couple of ways: (i) it’s expensive to produce, (ii) so little of it is available, which limits how much a learning algorithm can generalize from it. Video data, on the other hand, is available in endless variety, enabling video generation models to generalize across a wide variety of scenes, objects, and motions. ZeroHSI takes advantage of generated video to guide a 3D animation cheaply and effectively.

We’re thinking: There’s a lot of progress to be made in AI simply by finding clever ways to use synthetic data.

Subscribe to The Batch