A new model improves on recent progress in generating interactive virtual worlds from still images.
What’s new: Jack Parker-Holder and colleagues from Google introduced Genie 2, which generates three-dimensional video game worlds that respond to keyboard inputs in real time. The model’s output remains consistent (that is, elements don’t morph or disappear) for up to a minute, and it includes first-person shooters, walking simulators, and driving games from viewpoints that include first person, third person, and isometric. Genie 2 follows up on Genie, which generates two-dimensional games.
How it works: Genie 2 is a latent diffusion model that generates video frames made up of an encoder, transformer, and decoder. The developers didn’t reveal how they built it or how they improved on earlier efforts.
- Given video frames, the encoder embeds them. Using those embeddings and keyboard input, the transformer generates the embedding of the next video frame. The decoder takes the new embedding and generates an image.
- At inference, given an image as the starting frame, the encoder embeds it. Given the embedding and keyboard input, the transformer generates the embedding of the next frame, which the decoder uses to generate an image. After the initial frame, the transformer uses embeddings it generated previously plus keyboard input to generate the next embedding.
Behind the news: Genie 2 arrives on the heels of Oasis, which generates a Minecraft-like game in real time. Unlike Oasis, Genie 2 worlds are more consistent and not limited to one type of game. It also comes at the same time as another videogame generator, World Labs. However, where Genie 2 generates the next frame given previous frames and keyboard input (acting, in terms of game development, as both graphics and physics engines), World Labs generates a 3D mesh of a game world from a single 2D image. This leaves the implementation of physics, graphics rendering, the player’s character, and other game mechanics to external software.
Why it matters: Genie 2 extends models that visualize 3D scenes based on 2D images to encompass interactive worlds, a capability that could prove valuable in design, gaming, virtual reality, and other 3D applications. It generates imagery that, the authors suggest, could serve as training data for agents to learn how to navigate and respond to commands in 3D environments.
We’re thinking: Generating gameplay directly in the manner of Genie 2 is a quick approach to developing a game, but the current technology comes with caveats. Developers can’t yet control a game’s physics or mechanics and they must manage any flaws in the model (such as a tendency to generate inconsistent worlds). In contrast, generating a 3D mesh, as World Labs does, is a more cumbersome approach, but it gives developers more control.