A real-time video generator lets you explore an open-ended, interactive virtual world — a video game without a game engine.
What’s new: Decart, a startup that’s building a platform for AI applications, and Etched, which designs specialized AI chips, introduced Oasis, which generates a Minecraft-like game in real time. The weights are open and available here. You can play with a demo here.
How it works: The system generates one frame at a time based on a user’s keystrokes, mouse movements, and previously generated frames. The training dataset is undisclosed, but it’s almost certainly based on videos of Minecraft gameplay, given the output’s striking semblance to that game.
- Some recent video generators produce an initial frame, then the nth frame, and then the frames in between. This approach isn’t practical for real-time gameplay. Instead, Oasis learned to generate the next frame. A ViT encoder embeds previously generated frames. Given those embeddings, an embedding of a frame to which noise had been added, and a user’s input, a diffusion transformer learned to remove the noise using a variation on diffusion called diffusion forcing.
- Generated frames may contain glitches, and such errors can snowball if the model incorporates glitches from previous frames into subsequent frames. To avoid this, during training, the system added noise to embeddings of previous frames before feeding them to the transformer to generate the next frame. This way, the transformer learned to ignore glitches while producing new frames.
- At inference, the ViT encoder embeds previously generated frames, and the system adds noise to the frame embeddings. Given the user’s input, the noisy frame embeddings, and a pure-noise embedding that represents the frame to be generated, the transformer iteratively removes the noise from the previous and current frame embeddings. The ViT’s decoder takes the denoised current frame embedding and produces an image.
- The system currently runs on Nvidia H100 GPUs using Decart’s inference technology, which is tuned to run transformers on that hardware. The developers aim to change the hardware to Etched’s Sohu chips, which are specialized for transformers and process Llama 70B at a jaw-dropping 500,000 tokens per second.
Results: The Oasis web demo enables users to interact with 360-by-360-pixel frames at 20 frames per second. Users can place blocks, place fences, and move through a Minecraft-like world. The demo starts with an image of a location, but users can upload an image (turning, say, a photo of your cat into a blocky Minecraft-style level, as reported by Wired).
Yes, but: The game has its fair share of issues. For instance, objects disappear and menus items change unaccountably. The world’s physics are similarly inconsistent. For instance, players don’t fall into holes dug directly beneath them and, after jumping into water, players are likely to find themselves standing on a blue floor.
Behind the news: In February, Google announced Genie, a model that generates two-dimensional platformer games from input images. We weren’t able to find a publicly available demo or model.
Why it matters: Oasis is more a proof of concept than a product. Nonetheless, as an open-world video game entirely generated by AI — albeit based on data produced by a traditional implementation — it sets a bar for future game generators.
We’re thinking: Real-time video generation suggests a wealth of potential applications — say, a virtual workspace for interior decorating that can see and generate your home, or an interactive car repair manual that can create custom clips based on your own vehicle. Oasis is an early step in this direction.