A pretrained large language model has helped a robot resolve high-level commands into sequences of subtasks. It can do this more precisely with additional training — both on language-vision tasks and robotics tasks.
What’s new: Danny Driess and colleagues at Google and Technische Universität Berlin proposed PaLM-E, a large multimodal model designed to help control robots. PaLM-E takes a text command, and in executing the command, uses sensor data from a robot to resolve it into a series of low-level subcommands. A separate system converts these low-level commands into robotic control signals. The name adds E, for embodied, to that of Google’s large language model PaLM.
Key insight: Large language models tend to perform well if they’re trained on a lot of data. We don’t have a lot of robotics data (that is, records of commands, actions taken, and corresponding sensor readings). We can supplement that with vision-language data, which is plentiful, to help the model learn relationships between words and what a robot sees, and ultimately transfer what it learns to performing robotics tasks.
How it works: PaLM-E comprises a pretrained PaLM large language model and encoders that embed non-text inputs: (i) a pretrained vision transformer to embed images and (ii) a vanilla neural network to embed robot sensor data that described the pose, size, and color of objects in its view. In addition, the system relies on a motion controller that translates words into robotic control signals; in this case, a pretrained RT-1. Given a high-level command (such as “I spilled my drink, can you bring me something to clean it up?”) — plus images or sensor data from the robot — PaLM-E evaluates the robot’s situation and generates lower-level instructions to be fed to the motion controller.
- The authors trained the system for visual reasoning (fine-tuning the language model and ViT and training the vanilla neural network from scratch). They used 12 datasets mostly for visual question answering and image captioning. They also used three datasets designed for training robots to manipulate objects, such as Task and Motion Planning (TAMP), in which each example includes a text instruction and lists of initial and final sensor data.
- They formatted the data by interleaving text with embeddings that represented images, for instance, “What happened between <img1> and <img2>,” where <img1> and <imag2> were embeddings. Given the interleaved input, the language model produced an answer (to a question-answering task), a caption (in an image captioning task), or instruction or sequence of instructions (for a robotics task).
- They further trained the system using nearly 3,000 plans generated by SayCan, a system that translates high-level instructions into sequences of subtasks and robotic commands. Given a command, steps taken so far, and an image of the current scene, the language model generated the next step of a plan. For example, given the command to bring something to clean up a spilled drink, and the steps taken so far (“1. Find a sponge, 2. Pick up the sponge,”) plus an image embedding, the language model generated a response such as “3. Bring the sponge to the user.”
- At inference, given a step in the plan, the RT-1 controller converted the words into robot control signals. The robot executed the task and generated a new image or sensor data. Given this output, the original instruction, and previous steps, the encoders produced embeddings and the language model generated the next step. It repeated this process until it generated the output “terminate.”
Results: The authors evaluated PaLM-E in a simulation where it executed tasks from TAMP, which accounted for 10 percent of its training/fine-tuning data. PaLM-E achieved 94.9 percent success. A version of PaLM-E trained only on TAMP achieved 48.6 percent. SayCan, which also was trained only on TAMP, achieved 36 percent. The authors also tested PaLM-E using two physical robots, qualitatively evaluating its response to commands such as “Bring me the rice chips from the drawer.” The robots were able to follow instructions even when people tried to thwart them (say, by returning the bag of chips to the drawer immediately after the robot had pulled them out). You can watch a video here.
Why it matters: PaLM-E performed somewhat better than other systems that translate English into robotic control signals that were trained only on robotics data. But with additional training on vision-language and language-only tasks, it vastly outperformed them. Training on these apparently unrelated tasks helped the model learn how to control a robot.
We’re thinking: Training on massive amounts of text and images continues to be a key to improving model performance across a wide variety of tasks — including, surprisingly, robotics.