Gemini Thinks Faster Google’s Gemini 2.0 Flash Thinking advances in reasoning, outperforms DeepSeek-R1

Published
Reading time
2 min read
Line charts showing performance improvements in math and science with 2.0 Flash Thinking models.

Google updated the December-vintage reasoning model Gemini 2.0 Flash Thinking and other Flash models, gaining ground on OpenAI o1 and DeepSeek-R1.

What’s new: Gemini 2.0 Flash Thinking Experimental 1-21 is a vision-language model (images and text in, text out) that’s trained to generate a structured reasoning process or chain of thought. The new version improves on its predecessor’s reasoning capability and extends its context window. It's free to access via API while it remains designated “experimental” and available to paid users of the Gemini app, along with Gemini 2.0 Flash (fresh out of experimental mode) and the newly released Gemini 2.0 Pro Experimental. The company also launched a preview of Gemini 2.0 Flash Lite, a vision-language model (images and text in, text out) that outperforms Gemini 1.5 Flash at the same price.
How it works: Gemini 2.0 Flash Thinking Experimental 1-21 is based on Gemini 2.0 Flash Experimental (parameter count undisclosed). It processes up to 1 million tokens of input context, compared to its predecessor’s 32,000 and o1’s 128,000.

  • Unlike o1, which hides its chain of thought, but like DeepSeek-R1 and Qwen QwQ, Gemini 2.0 Flash Thinking Experimental 1-21 includes its reasoning in its output.
  • On the graduate-level science exam GPQA-Diamond, it achieved 74.2 percent compared to the earlier version’s 58.6 percent, surpassing DeepSeek-R1 (71.5 percent) but behind o1 (77.3 percent).
  • On the advanced math benchmark AIME 2024, it achieved 73.3 percent compared to the previous version’s 35.5 percent, but it trails behind DeepSeek-R1 (79.8 percent) and o1 (74.4 percent).
  • On the visual and multimedia understanding test MMMU, it achieved 75.4 percent to outperform the previous version (70.7 percent) but fell short of o1 (78.2 percent).
  • Developers can integrate Python code execution via the API, with support for data analysis and visualization through pre-installed libraries.

Speed bumps: Large language models that are trained to generate a chain of thought (CoT) are boosting accuracy even as the additional processing increases inference costs and latency. Reliable measures of Gemini 2.0 Flash Thinking Experimental 1-21’s speed are not yet available, but its base model runs faster (168.8 tokens per second with 0.46 seconds of latency to the first token, according to Artificial Analysis) than all models in its class except o1-mini (which outputs 200 tokens per second with 10.59 seconds of latency to the first token).

Why it matters: The combination of CoT reasoning and long context — assuming the new model can take advantage of its 1 million-token context window, as measured by a benchmark such as RULER — could open up valuable applications. Imagine a reasoning model that can take an entire codebase as input and analyze it without breaking it into smaller chunks.

We’re thinking: Regardless of benchmark performance, this model topped the Chatbot Arena leaderboard at the time of writing. This suggests that users preferred it over o1 and DeepSeek-R1 — at least for common, everyday prompts.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox