Google’s annual I/O developers’ conference brought a plethora of updates and new models.
What’s new: Google announced improvements to its Gemini 1.5 Pro large multimodal model — notably increasing its already huge input context window — as well as new open models, a video generator, and a further step in digital assistants. In addition, Gemini models will power new features in Google Search, Gmail, and Android.
How it works: Google launched a variety of new capabilities.
- Gemini 1.5 Pro’s maximum input context window doubled to 2 million tokens of text, audio, and/or video — roughly 1.4 million words, 60,000 lines of code, 2 hours of video, or 22 hours of audio. The 2 million-token context window is available in a “private preview” via Google’s AI Studio and Vertex AI. The 1 million-token context window ($7 per 1 million tokens) is generally available on those services in addition to the previous 128,000 window ($3.50 per 1 million tokens).
- Gemini 1.5 Flash is a faster distillation of Gemini 1.5 Pro that features a 1 million token context window. It’s available in preview via Vertex AI. Due to be generally available in June, it will cost $0.35 per million tokens of input for prompts up to 128,000 tokens or $0.70 per million tokens of input for longer prompts.
- The Veo video generator can create videos roughly a minute long at 1080p resolution. It can also alter videos, for instance keeping part of the imagery constant and regenerating the rest. A web interface called VideoFX is available via a waitlist. Google plans to roll out Veo to YouTube users.
- Google expanded the Gemma family of open models. PaliGemma, which is available now, accepts text and images and generates text. Gemma 2, which will be available in June, is a 27 billion-parameter large language model that aims to match the performance of Llama 3 70B at less than half the size.
- Gemini Live is a smartphone app for real-time voice chat. The app can converse about photos or video captured by the phone’s camera — in the video demo shown above, it remembers where the user left her glasses! It’s part of Project Astra, a DeepMind initiative that aims to create real-time, multimodal digital assistants.
Precautionary measures: Amid the flurry of new developments, Google published protocols for evaluating safety risks. The “Frontier Safety Framework” establishes risk thresholds such as a model’s ability to extend its own capabilities, enable a non-expert to develop a potent biothreat, or automate a cyberattack. While models are in development, researchers will evaluate them continually to determine whether they are approaching any of these thresholds. If so, developers will make a plan to mitigate the risk. Google aims to implement the framework by early 2025.
Why it matters: Gemini 1.5 Pro’s expanded context window enables developers to apply generative AI to multimedia files and archives that are beyond the capacity of other models currently available — corporate archives, legal testimony, feature films, shelves of books — and supports prompting strategies such as many-shot learning. Beyond that, the new releases address a variety of developer needs and preferences: Gemini 1.5 Flash offers a lightweight alternative where speed or cost is at a premium, Veo appears to be a worthy competitor for OpenAI’s Sora, and the new open models give developers powerful options.
We’re thinking: Google’s quick iteration on its Gemini models is impressive. Gemini 1.0 was announced less than six months ago. White-hot competition among AI companies is giving developers more choices, faster speeds, and lower prices.