Vision-Language, Compact and Open Google releases Gemma 3 vision-language models with open weights

Published

Mar 26, 2025

Reading time

3 min read

Google updated its open-weights family of large language models to include versions that handle image and video inputs.

What’s new: Google released its Gemma 3 multilingual large language models with parameter counts of 1 billion, 4 billion, 12 billion, and 27 billion. While the smallest processes text only, the other three are vision-language models that are small enough to run on a consumer hardware.

Input/output: Gemma 3 1B: text-in (up to 32,000 tokens), text out (up to 8,192 tokens). Gemma 3 4B, 7B, 27B: text, images/video in (up to 128,000 tokens), text out (up to 8,192 tokens). Gemma 3 27B outputs 24.61 tokens per /second, 0.68 seconds to first token.
Knowledge cutoff: March 2024
Architecture: Gemma 3 1B: Transformer. Gemma 3 4B, 12B, 27B: Transformer, SigLIP vision encoder.
Features: 140 languages, function calling, structured output.
Training data: Gemma 3 1B: 2 trillion tokens of web text, code, and mathematics. Gemma 3 4B, 12B, 27B: between 4 trillion and 14 trillion tokens of text and images.
Availability/price: Weights free to download from Hugging Face and Kaggle under a license that allows noncommercial and commercial uses with some restrictions. Available free via Google’s AI Studio.

How it works: Gemma 3 rearchitects and refines earlier Gemma models for higher performance at lower parameter counts.

To save memory, Gemma 3 interleaves five local attention layers for every global attention layer. Global attention layers attend to the entire input, while local attention layers attend to 1,024 tokens.
The models were fine-tuned to encourage their outputs to match those of an unspecified larger teacher model.
Gemma 3 learned via reinforcement learning in three ways. (i) The models were aligned with human preferences via reinforcement learning from human feedback (RLHF). (ii) They were fine-tuned to solve math problems via reinforcement learning, much like DeepSeek-R1. (iii) They were trained to generate better code via reinforcement learning from execution feedback (RLEF). Specifically, over several rounds of output, RLEF tested generated code on a subset of tests, then prompted the model to fix any bugs. RLEF rewarded the models if their final output passed all tests.

Performance: Gemma 3 models outperform Gemma 2 models of equal or larger size by several measures, and all sizes show a strong ability to solve mathematics word problems as measured by MATH.

In Google’s tests, Gemma 3 1B performs roughly comparably to Gemma 2 2B, outperforming the larger model on LiveCodeBench (1.9 percent to 1.2 percent) and MATH (48.0 percent to 27.2 percent).
Gemma 3 4B achieves roughly comparable performance to Gemma 2 9B, Llama 3.1 8B, and Qwen2.5-7B. It’s slightly behind Microsoft Phi-4 Mini (also 4 billion parameters), except on MATH, according to that company’s tests.
Gemma 3 12B improves on Gemma 2 27B and compares to Gemini 1.5 Flash (in TIGER-Lab’s tests) and Anthropic Claude 3.5 Haiku (in that developer’s tests). It outperforms the larger, proprietary models on MATH.
Gemma 3 27B consistently outperforms the Gemma 2 model of the same size and performs comparably to Gemini 1.5 Pro on MMLU-Pro (high-level language comprehension) 67.5 percent to 56.9 percent, on LiveCodeBench (coding) 29.7 percent to 20.4 percent, on GPQA Diamond (graduate-level domain knowledge) 42.4 percent to 34.3 percent, and on MATH 89.0 percent to 55.6 percent.
Moreover, Gemma 3 27B achieves 1,338 ELO in Chatbot Arena, a top-ten score that puts it ahead of OpenAI o1 and behind only DeepSeek-R1 among models with open weights.

Hot on Gemma 3’s heels: Shortly after Gemma 3 became available, Mistral released Small 3.1 (24 billion parameters), a vision-language model with open weights, under a more permissive Apache 2.0 license.

Mistral Small 3.1 is similarly multilingual and offers a 128,000 token context window.
It slightly outperforms Gemma 3 27B on MMLU, MMLU-Pro, MMMU, and other selected benchmarks.
It also outperforms Gemma 3 27B and other models in its size range on long-context tests. (However, Gemma 3 27B performs better in the Chatbot Arena test of human preference.)

Why it matters: Gemma 3 takes advantage of a variety of techniques to raise the bar for vision-language performance in relatively small models. Knowledge distillation, multiple rounds of reinforcement learning, and fine-tuning on many languages are a powerful combination.

We’re thinking: A vision-language model small enough to run on a smartphone feels increasingly close!

Subscribe to The Batch