A new generation of robots can handle some household chores with unusual skill.
What’s new: Physical Intelligence, a startup based in San Francisco, unveiled π0 (pronounced “pi-zero”), a machine learning system that enables robots to perform housekeeping tasks that require high coordination and dexterity, like folding clothes and cleaning tables. The company also announced $400 million in investments from OpenAI, Jeff Bezos, and several Silicon Valley venture capital firms.
How it works: π0 is a version of the pretrained PaliGemma vision-language model that has been modified for flow matching. (Flow matching is similar to diffusion, in which a model learns to remove noise from inputs to which noise has been added, and ultimately generates output by removing noise from an input of pure noise). A user supplies a text command, and the robot uses its sensor inputs to remove noise from a pure-noise action embedding to generate an appropriate action.
- PaliGemma comprises SigLIP, a vision transformer that turns images into embeddings; a linear layer that adapts the image embeddings to serve as input for the pretrained large language model Gemma; and Gemma, which estimates the noise to be removed from a robot action embedding to which noise has been added.
- The authors modified PaliGemma as follows: (i) They adapted it to accept embeddings that represent the robots’ state and previous actions, and to generate embeddings that represent the noise to be removed from noisy robot actions. (ii) They added a vanilla neural network to the input to turn the current timestep into an embedding. (iii) They modified Gemma to be a mixture-of-experts model: One expert, or subset of weights, is the pretrained weights, which process image and text embeddings. The other is a new set of weights that process robot action embeddings.
- They pretrained π0 to remove noise from action embeddings. (Since π0 produces embeddings of the noise to be removed, removing that noise is as simple as adding the two embeddings.)
- Training data included the Open X-Embodiment Dataset and a proprietary dataset of 10,000 hours of robotic states (for instance, current positions of a robot’s joints), actions (for instance, motions of the robot’s joints), and an associated language command. The proprietary dataset included data collected from seven different robots (such as a single stationary robot arm to two robot arms mounted on a mobile base) and 68 tasks (for example, folding laundry, making coffee, or bussing a table).
- After pretraining, the authors fine-tuned π0 to remove noise from action tokens in 15 further tasks, some of which were not represented in the pretraining set. These tasks improved the model’s ability to follow more detailed instructions and perform multi-stage tasks such as packing food into a to-go box.
- At inference, given the robot’s camera view of the surrounding scene, SigLip embeds the images. A linear layer projects the resulting embeddings to fit Gemma’s expected input size and data distribution. Given the images, text command, robot’s state, current timestep, and 50 noisy action tokens (starting with pure noise), Gemma iteratively removes noise. To complete longer tasks, the process repeats: The robot takes more images of the surrounding scene and retrieves the robot’s state, which π0 uses to generate further actions.
Results: π0 outperformed the open robotics models OpenVLA, Octo, ACT, and Diffusion Policy, all of which were fine-tuned on the same data, on all tasks tested, as measured by a robot’s success rate in completing each task. For example, using a single robotic arm to stack a set of bowls of four sizes, π0 completed about 100 percent on average. Diffusion Policy completed about 55 percent, ACT about 45 percent, and OpenVLA and Octo below 10 percent. Across all tasks, π0 completed about 80 percent on average, while Diffusion Policy completed about 35 percent on average.
Yes, but: The robot occasionally makes mistakes. In one video, it puts too many eggs into a carton and tries to force it shut. In another, it throws a container off a table instead of filling it with items.
Behind the news: Commercial robotics appears to be undergoing a renaissance. Skild raised $300 million to develop a “general-purpose brain for robots.” Figure AI secured $675 million to build humanoid robots powered by multimodal models. Covariant, which specializes in industrial robotics, licensed its technology to Amazon. (Disclosure: Andrew Ng is a member of Amazon's board of directors). OpenAI renewed its robotics effort after dismantling its robotics department in 2020.
Why it matters: Robots have been slow to benefit from machine learning, but the generative AI revolution is driving rapid innovations that make them much more useful. Large language models have made it possible to command robots using plain English. Meanwhile, the team at Physical Intelligence collected a dataset of sufficient size and variety to train the model to generate highly articulated and practical actions. Household robots may not be right around the corner, but π0 shows that they can perform tasks that people need done.
We’re thinking: One of the team members compared π0 to GPT-1 for robotics — an inkling of things to come. Although there are significant differences between text data (which is available in large quantities) and robot data (which is hard to get and varies per robot), it looks like a new era of large robotics foundation models is dawning.