While Hangzhou’s DeepSeek flexed its muscles, Chinese tech giant Alibaba vied for the spotlight with new open vision-language models.
What’s new: Alibaba announced Qwen2.5-VL, a family of vision-language models (images and text in, text out) in sizes of 3 billion, 7 billion, and 72 billion parameters. The weights for all three models are available for download on Hugging Face, each under a different license: Qwen2.5-VL-3B is free for non-commercial uses, Qwen2.5-VL-7B is free for commercial and noncommercial uses under the Apache 2.0 license, and Qwen2.5-VL-72B is free to developers that have less than 100 million monthly active users. You can try them out for free for a limited time in Alibaba Model Studio, and Qwen2.5-VL-72B is available via the model selector in Qwen Chat.
How it works: Qwen2.5-VL models accept up to 129,024 tokens of input according to the developer reference (other sources provide conflicting numbers) and generate up to 8,192 tokens of output. Alibaba has not released details about how it trained them.
- Qwen2.5-VL comprises a vision encoder and large language model. It can parse videos, images, text, and is capable of computer use (desktop and mobile).
- The vision encoder accepts images of different sizes and represents them with different numbers of tokens depending on the size. For instance, one image might be 8 tokens and another 1125 tokens. This enabled the model to learn about the scale of images and to estimate the coordinates of objects in an image without rescaling.
- To reduce computation incurred by the vision encoder, the team replaced attention (which considers the entire input context) with windowed attention (which limits the input context to a window around a given token) and used full attention only in four layers. The resulting efficiency improves training and inference speeds.
Results: Alibaba reports Qwen2.5-VL-72B’s performance on measures that span image and text problems, parsing documents, understanding videos, and interacting with computer programs. Across 21 benchmarks, it beat Microsoft Gemini 2.0 Flash, OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, and open competitors on 13 of them (where comparisons are relevant and available).
- For example, on answering math questions about images in MathVista, Qwen2.5-VL-72B achieved 74.8 percent, while the closest competing model (Gemini 2.0 Flash) achieved 73.1 percent.
- In Video-MME, which evaluates a model’s ability to answer questions about videos, Qwen 2.5 VL achieved 73.3 percent. GPT-4o achieved 71.9 percent and InternVL2.5, the next-best open competitor, achieved 72.1 percent.
- Used in an agentic workflow, Qwen2.5-VL-72B outperformed Claude 3.5 Sonnet when controlling Android devices and navigating desktop user interfaces. However, it finished second to other open vision-language models in several tests.
More models: Alibaba also introduced competition for DeepSeek and a family of small models.
- Qwen2.5-Max is a mixture-of-experts model that outperforms GPT-4o and DeepSeek-V3 on graduate-level science questions in GPQA-Diamond and regularly updated benchmarks like Arena-Hard, LiveBench, and LiveCodeBench. However, Qwen2.5-Max performed worse than o1 and DeepSeek-R1.
- Qwen2.5-1M is a family of smaller language models (7 billion and 14 billion parameters) that accept up to 1 million tokens of input context.
Why it matters: Vision-language models are getting more powerful and versatile. Not long ago, it was an impressive feat simply to answer questions about a chart or diagram that mixed graphics with text. Now such models are paired with an agent to control computers and smartphones. Broadly speaking, the Qwen2.5-VL models outperform open and closed competitors and they’re open to varying degrees (though the data is not available), giving developers a range of highly capable choices.
We’re thinking: We’re happy Alibaba released a vision-language model that is broadly permissive with respect to commercial use (although we’d prefer that all sizes were available under a standard open weights license). We hope to see technical reports that illuminate Alibaba’s training and fine-tuning recipes.