Training an agent that controls a robot arm to perform a task — say, opening a door — that involves a sequence of motions (reach, grasp, turn, pull, release) can take from tens of thousands to millions of examples. A new approach pretrained an agent on many tasks for which lots of data was available, so it needed dramatically fewer examples to learn related tasks.
What’s new: Joey Hejna and Dorsa Sadigh at Stanford used a variation on reinforcement learning from human feedback (RLHF) to train an agent to perform a variety of tasks in simulation. The team didn’t handcraft the reward functions. Instead, neural networks learned them.
RLHF basics: A popular approach to tuning large language models, RLHF follows four steps: (1) Pretrain a generative model. (2) Use the model to generate data and have humans assign a score to each output. (3) Given the scored data, train a model — called the reward model — to mimic the way humans assigned scores. Higher scores are tantamount to higher rewards. (4) Use scores produced by the reward model to fine-tune the generative model, via reinforcement learning, to produce high-scoring outputs. In short, a generative model produces an example, a reward model scores it, and the generative model learns based on that score.
Key insight: Machine-generated data is cheap, while human-annotated data is expensive. So, if you’re building a neural network to estimate rewards for several tasks that involve similar sequences of motions, it makes sense to pretrain it for a set of tasks using a large quantity of machine-generated data, and then fine-tune a separate copy for each task to be performed using small amounts of human-annotated data.
- The Meta-World benchmark provides machine-generated data for reinforcement learning (RL): It provides simulated environments for several tasks and trained models that execute the tasks. The models make it possible to record motion sequences along with a model’s estimate of its probability of success for each possible motion. Collecting high- and low-probability sequences provides a large dataset of good and bad motions that translate into high or low rewards.
- Humans can annotate such sequences to create a smaller number of examples of motions and rewards. These examples can be curated to highlight cases that make for more efficient learning.
How it works: The authors trained an RL agent to perform 10 simulated tasks from Meta-World such as pushing a block, opening a door, and closing a drawer. For each task, they fine-tuned a separate pretrained vanilla neural network to calculate rewards used in training the agent.
- The authors pretrained the reward model using a method designed to find weights that could be readily fine-tuned for a new task using a small number of examples. Given two motion sequences and their probabilities (generated by models included in Meta-World), the network was pretrained to decide which was worse or better for executing the task at hand.
- For six new tasks, the authors generated a small number (between 6 and 20 depending on the task) of motion sequences using their agent. Human annotators labeled them better or worse for executing the task at hand. The authors fine-tuned the reward model on these examples.
- Using a small number of motion sequences for the task at hand, the authors trained the agent to complete the task based on rewards calculated by the reward model.
- The authors repeated the loop — fine-tuning the reward model and training the agent — fine-tuning the reward model on up to 100 total human-annotated motion sequences for a task. They stopped when the agent’s performance no longer improved.
- The authors tried the same experiment substituting human annotations for Meta-World’s model-generated probabilities for the motion sequences. It took up to 2,500 total sequences for the agent to reach its optimal performance.
Results: Trained to open a window, the agent achieved 100 percent success after fine-tuning on 64 human-annotated motion sequences. Trained to close a door, it achieved 95 percent success with 100 human-annotated motion sequences. In contrast, using the same number of examples, PEBBLE, another RL method that involves human feedback, achieved 10 percent and 75 percent success respectively. Fed machine-generated examples rather than human feedback, the agent achieved 100 percent success on all Meta-World tasks except pressing a button after fine-tuning on 2,500 examples — 20 times fewer than PEBBLE required to achieve the same performance.
Why it matters: OpenAI famously fine-tuned ChatGPT using RLHF, which yielded higher-quality, safer output. Now this powerful technique can be applied to robotics.
We’re thinking: Pretraining followed by fine-tuning opens the door to building AI systems that can learn new tasks from very little data. It's exciting to see this idea applied to building more capable robots.