AI agents are typically designed to operate a particular software environment. Recent work enabled a single agent to take actions in a variety of three-dimensional virtual worlds.
What's new: A team of 90 people at Google and University of British Columbia announced Scalable Instructable Multiworld Agent (SIMA), a system that learned to follow text instructions (such as “make a pile of rocks to mark this spot” or “see if you can jump over this chasm”) in seven commercial video games and four research environments.
How it works: SIMA’s architecture consists of several transformers and a vanilla neural network. The authors trained it to mimic human players using a dataset of gameplay broken into 10 second tasks, including onscreen images, text instructions, keyboard presses, and mouse motions. The video games included Goat Simulator 3 (a third-person game in which the player takes the form of a goat), No Man’s Sky (a first- or third-person game of exploration and survival in outer space), Hydroneer (a first-person game of mining and building), and others.
- Given a text instruction and a frame of onscreen imagery, SPARC (a pair of transformers pretrained on text-image pairs to produce similar embeddings of similar text and images) produced text and image embeddings. Given recent frames, Phenaki (a transformer pretrained to predict future frames in a video) generated a video embedding.
- Given the image, text, and video embeddings, a collection of transformers learned to produce a representation of the game. (The authors don’t fully describe this part of the architecture.)
- Given the game representation, a vanilla neural network learned to produce the corresponding keyboard and mouse actions.
Results: Judges evaluated SIMA’s success or failure at completing nearly 1,500 instructions that spanned tasks in nine categories like action (“jump”), navigation (“go to your ship”), and gathering resources (“get raspberries”). In Goat Simulator 3, SIMA completed 40 percent of the tasks. In No Man’s Sky, the judges compared SIMA’s performance to that of the human players whose gameplay produced the training data. SIMA was successful 34 percent of the time, while the players were successful 60 percent of the time. Judges also compared SIMA to versions that were trained to be experts in a single game. SIMA was successful more than 1.5 times more often than the specialized agents.
Behind the news: SIMA extends Google’s earlier successes building agents that rival or beat human players at individual games including Go, classic Atari games, and StarCraft II.
Why it matters: Training agents to follow directions in various environments, seeing the same things humans would, is a step toward building instructable agents that can work in any situation. The authors point to potential applications in robotics, simulations, and gaming; wherever an agent might need to be guided through diverse challenges.
We're thinking: This work shows that an agent trained on multiple games can perform better than an agent trained on just one, and that the richer the language inputs in a gameworld, the better the agent can perform. With only a handful of training environments under its belt, SIMA doesn’t demonstrate superhuman performance, but it gets the job done a surprising amount of the time!