People who take driving lessons during daytime don’t need instruction in driving at night. They recognize that the difference doesn’t disturb their knowledge of how to drive. Similarly, a new reinforcement learning method manages superficial variations in the environment without re-training.
What’s new: Nicklas Hansen led a UC Berkeley group in developing Policy Adaptation during Deployment (Pad), which allows agents trained by any RL method to adjust for visual changes that don’t impact the optimal action.
Key insight: Deep reinforcement learning agents often learn to extract important features of the environment and then choose the optimal course of action based on those features. The researchers designed a self-supervised training task that updates a feature extractor to account for environmental changes without disturbing the strategy for selecting actions.
How it works: In most agents, a feature extractor captures visual information about the environment while a controller decides on actions. A change in the surroundings — say, from day to night — causes the feature extractor to derive different features, which can confuse the controller. Pad, once it’s deployed and no longer receives rewards, continues to update the feature extractor while leaving the controller unaffected. Thus the agent learns to use the same strategy regardless of environmental changes.
- Pad uses an inverse dynamics network to make the correct adjustments without receiving a reward. This network decides which action caused a transition from one state to the next. In a self-driving car, for example, it would predict that the steering wheel turned left when the car transitions from the middle lane to the left lane.
- During training, the feature extractor learns features from the controller’s loss. The inverse dynamics network learns environmental mechanics from the extracted features. This task is self-supervised; the agent keeps track of where it was, what it did, and where it ended up.
- At deployment, with a new environment and without rewards, the inverse dynamics network continues to learn. Its output updates the feature extractor, encouraging the extractor to adapt to small visual changes. The updated extractor should produce similar features for the new environment as the original version did for the training environment.
Results: The researchers evaluated Pad by training an agent via the soft actor-critic method, then substituting a plain-color background with a video at test time. On the DeepMind Control Suite, which includes motor-control tasks such as walking, Pad improved the soft actor-critic baseline in the new environment on seven of eight tasks.
Yes, but: If the environment doesn’t change, Pad hurts performance (albeit minimally).
Why it matters: To be useful in the real world, reinforcement learning agents must handle the transition from simulated to physical environments and cope gracefully with changes of scenery after they’ve been deployed. While all roads have similar layouts, their backgrounds may differ substantially, and your self-driving car should keep its eyes on the road. Similarly, a personal-assistant robot shouldn’t break down if you paint your walls.
We’re thinking: Robustness is a major challenge to deploying machine learning: The data we need to operate on is often different than the data available for training. We need more techniques like this to accelerate AI deployments.