Vision-Language, Compact and Open Google releases Gemma 3 vision-language models with open weights

Published
Reading time
3 min read
Comparison table of Gemini and Gemma models across benchmarks like MMLU, MATH, and GPQA with radar charts.
Loading the Elevenlabs Text to Speech AudioNative Player...

Google updated its open-weights family of large language models to include versions that handle image and video inputs.

What’s new: Google released its Gemma 3 multilingual large language models with parameter counts of 1 billion, 4 billion, 12 billion, and 27 billion. While the smallest processes text only, the other three are vision-language models that are small enough to run on a consumer hardware.

  • Input/output: Gemma 3 1B: text-in (up to 32,000 tokens), text out (up to 8,192 tokens). Gemma 3 4B, 7B, 27B: text, images/video in (up to 128,000 tokens), text out (up to 8,192 tokens). Gemma 3 27B outputs 24.61 tokens per /second, 0.68 seconds to first token.
  • Knowledge cutoff: March 2024
  • Architecture: Gemma 3 1B: Transformer. Gemma 3 4B, 12B, 27B: Transformer, SigLIP  vision encoder.
  • Features: 140 languages, function calling, structured output.
  • Training data: Gemma 3 1B: 2 trillion tokens of web text, code, and mathematics. Gemma 3 4B, 12B, 27B: between 4 trillion and 14 trillion tokens of text and images.
  • Availability/price: Weights free to download from Hugging Face and Kaggle under a license that allows noncommercial and commercial uses with some restrictions. Available free via Google’s AI Studio.

How it works: Gemma 3 rearchitects and refines earlier Gemma models for higher performance at lower parameter counts.

  • To save memory, Gemma 3 interleaves five local attention layers for every global attention layer. Global attention layers attend to the entire input, while local attention layers attend to 1,024 tokens.
  • The models were fine-tuned to encourage their outputs to match those of an unspecified larger teacher model.
  • Gemma 3 learned via reinforcement learning in three ways. (i) The models were aligned with human preferences via reinforcement learning from human feedback (RLHF). (ii) They were fine-tuned to solve math problems via reinforcement learning, much like DeepSeek-R1. (iii) They were trained to generate better code via reinforcement learning from execution feedback (RLEF). Specifically, over several rounds of output, RLEF tested generated code on a subset of tests, then prompted the model to fix any bugs. RLEF rewarded the models if their final output passed all tests.

Performance: Gemma 3 models outperform Gemma 2 models of equal or larger size by several measures, and all sizes show a strong ability to solve mathematics word problems as measured by MATH

  • In Google’s tests, Gemma 3 1B performs roughly comparably to Gemma 2 2B, outperforming the larger model on LiveCodeBench (1.9 percent to 1.2 percent) and MATH (48.0 percent to 27.2 percent). 
  • Gemma 3 4B achieves roughly comparable performance to Gemma 2 9B, Llama 3.1 8B, and Qwen2.5-7B. It’s slightly behind Microsoft Phi-4 Mini (also 4 billion parameters), except on MATH, according to that company’s tests.
  • Gemma 3 12B improves on Gemma 2 27B and compares to Gemini 1.5 Flash (in TIGER-Lab’s tests) and Anthropic Claude 3.5 Haiku (in that developer’s tests). It outperforms the larger, proprietary models on MATH. 
  • Gemma 3 27B consistently outperforms the Gemma 2 model of the same size and performs comparably to Gemini 1.5 Pro on MMLU-Pro (high-level language comprehension) 67.5 percent to 56.9 percent, on LiveCodeBench (coding) 29.7 percent to 20.4 percent, on GPQA Diamond (graduate-level domain knowledge) 42.4 percent to 34.3 percent, and on MATH 89.0 percent to 55.6 percent.
  • Moreover, Gemma 3 27B achieves 1,338 ELO in Chatbot Arena, a top-ten score that puts it ahead of OpenAI o1 and behind only DeepSeek-R1 among models with open weights.

Hot on Gemma 3’s heels: Shortly after Gemma 3 became available, Mistral released Small 3.1 (24 billion parameters), a vision-language model with open weights, under a more permissive Apache 2.0 license. 

  • Mistral Small 3.1 is similarly multilingual and offers a 128,000 token context window.
  • It slightly outperforms Gemma 3 27B on MMLU, MMLU-Pro, MMMU, and other selected benchmarks.
  • It also outperforms Gemma 3 27B and other models in its size range on long-context tests. (However, Gemma 3 27B performs better in the Chatbot Arena test of human preference.) 

Why it matters: Gemma 3 takes advantage of a variety of techniques to raise the bar for vision-language performance in relatively small models. Knowledge distillation, multiple rounds of reinforcement learning, and fine-tuning on many languages are a powerful combination.

We’re thinking: A vision-language model small enough to run on a smartphone feels increasingly close!

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox