In reinforcement learning, if researchers want an agent to have an internal representation of its environment, they’ll build and train a world model that it can refer to. New research shows that world models can emerge from standard training, rather than needing to be built separately.
What’s new: Google Brain researchers C. Daniel Freeman, Luke Metz, and David Ha enabled an agent to build a world model by blindfolding it as it learned to accomplish tasks. They call their approach observational dropout.
Key insight: Blocking an agent’s observations of the world at random moments forces it to generate its own internal representation to fill in the gaps. The agent learns this representation without being instructed to predict how the environment will change in response to its actions.
How it works: At every timestep, the agent acts on either its observation (framed in red in the video above) or its prediction of what it wasn’t able to observe (imagery not framed in red). The agent contains a controller that decides on the most rewarding action. To compute the potential reward of a given action, the agent includes an additional deep net trained using the RL algorithm REINFORCE.
- Observational dropout blocks the agent from observing the environment according to a user-defined probability. When this happens, the agent predicts an observation.
- If random blindfolding blocks several observations in a row, the agent uses its most recent prediction to generate the next one.
- This procedure over many iterations produces a sequence of observations and predictions. The agent learns from this sequence, and its ability to predict blocked observations is tantamount to a world model.
Results: Observational dropout solved the task known as Cartpole, in which the model must balance a pole upright on a rolling cart, even when its view of the world was blocked 90 percent of the time. In a more complex Car Racing task, in which a model must navigate a car around a track as fast as possible, the model performed almost equally well whether it was allowed to see its surroundings or blindfolded up to 60 percent of the time.
Why it matters: Modeling reality is often part art and part science. World models generated by observational dropout aren’t perfect representations, but they’re sufficient for some tasks. This work could lead to simple-but-effective world models of complex environments that are impractical to model completely.
We’re thinking: Technology being imperfect, observational dropout is a fact of life, not just a research technique. A self-driving car or auto-piloted airplane reliant on sensors that drop data points could create a catastrophe. This technique could make high-stakes RL models more robust.