Multimodal Modeling on the Double Google introduces Gemini 2.0 Flash, a faster, more capable AI model

Published

Dec 18, 2024

Reading time

3 min read

Google’s Gemini 2.0 Flash, the first member of its updated Gemini family of large multimodal models, combines speed with performance that exceeds that of its earlier flagship model, Gemini 1.5 Pro, on several measures.

What’s new: Gemini 2.0 Flash processes an immense 2 million tokens of input context including text, images, video, and speech, and generates text, images, and speech. Text input/output is available in English, Spanish, Japanese, Chinese, and Hindi, while speech input/output is available in English only for now. It can use tools, generate function calls, and respond to a real-time API — capabilities that underpin a set of pre-built agents that perform tasks like research and coding. Gemini 2.0 Flash is available for free in an experimental preview version via Google AI Studio, Google Developer API, and Gemini Chat.

How it works: Gemini 2.0 Flash (parameter count undisclosed) matches or outperforms several competing models on key benchmarks, according to Google’s report.

Gemini 2.0 Flash is faster than Gemini 1.5 Flash. It offers relatively low average latency (0.53 seconds to receive the first token, just ahead of Mistral Large 2 and GPT-4o mini) and relatively high output speed (169.5 tokens per second, just ahead of AWS Nova Lite and OpenAI o1 Preview but behind Llama), according to Artificial Analysis.
It beats Gemini 1.5 Pro on multiple key benchmarks, including measures of language understanding (MMLU-Pro) and visual and multimedia understanding (MMMU). It also excels at competition-level math problems, achieving state-of-the-art results on MATH and HiddenMath. It outperforms Gemini 1.5 Pro when generating Python, Java, and SQL code (Natural2Code) and (LiveCodeBench).
Compared to competing models, Gemini 2.0 Flash does well on language and multimedia understanding. On MMLU-Pro, Gemini 2.0 Flash outperforms GPT-4o and is just behind Claude 3.5 Sonnet, according to TIGER-Lab. Google reports a score of 70.7 percent on MMMU, which would put it ahead of GPT-4o and Claude 3.5 Sonnet, but behind o1’s, on the MMMU leaderboard as of this publication date. It does less well on tests of coding ability, in which it underperforms Claude 3.5 Sonnet, GPT-4o, o1-preview, and o1-mini.
The Multimodal Live API feeds live-streamed inputs from cameras or screens to Gemini 2.0 Flash, enabling real-time applications like live translation and video recognition.
The model’s multimodal input/output capabilities enable it to identify and locate objects in images and reason about them. For instance, it can locate a spilled drink and suggest ways to clean it up. It can alter images according to natural-language commands, such as turning a picture of a car into a convertible, and explain the changes step by step.

Agents at your service: Google also introduced four agents that take advantage of Gemini 2.0 Flash’s ability to use tools, call functions, and respond to the API in real time. Most are available via a waitlist.

Astra, which was previewed in May, is an AI assistant for smartphones (and for prototype alternative-reality glasses that are in beta test with US and UK users). Astra recognizes video, text, images, and audio in real time and integrates with Google services to help manage calendars, send emails, and answer search queries.
Mariner automatically compares product prices, buys tickets, and organizes schedules on a user’s behalf using a Chrome browser extension.
Deep Research is a multimodal research assistant that analyzes datasets, summarized text, and compiles reports. It’s designed for academic and professional research and is available to Gemini Advanced subscribers.
Jules is a coding agent for Python and JavaScript. Given text instructions, Jules creates plans, identifies bugs, writes and completes code, issues GitHub pull requests, and otherwise streamlines development. Jules is slated for general availability in early 2025.

Behind the news: OpenAI showed off GPT-4o’s capability for real-time video understanding in May, but Gemini 2.0 Flash beat it to the punch: Google launched the new model and its multimodal API one day ahead of ChatGPT’s Advanced Voice with Vision.

Why it matters: Speed and multimodal input/output are valuable characteristics for any AI model, and they’re especially useful in agentic applications. Google CEO Sundar Pichai said he wants Gemini to be a “universal assistant.” The new Gemini-based applications for coding, research, and video analysis are steps in that direction.

We’re thinking: While other large language models can take advantage of search, Gemini 2.0 Flash generates calls to Google Search and uses that capability in agentic tools — a demonstration of how Google’s dominance in search strengthens its efforts in AI.

Subscribe to The Batch