Mistral’s Vision-Language Contender Mistral unveils Pixtral Large, a rival to top vision-language models

Published
Reading time
2 min read
Table comparing model performance on Mathvista, MMMU, ChartQA, DocVQA, and other tasks.
Loading the Elevenlabs Text to Speech AudioNative Player...

Mistral AI unveiled Pixtral Large, which rivals top models at processing combinations of text and images.

What’s new: Pixtral Large outperforms a number of leading vision-language models on some tasks. The weights are free for academic and non-commercial use and can be licensed for business use. Access is available via Mistral AI’s website or API for $2/$6 per million tokens for input/output. In addition, Pixtal Large now underpins le Chat, Mistral AI’s chatbot, which also gained several new features.

How it works: Pixtral Large generates text in response to text and images in dozens of languages. It processes 131,072 tokens of context, which is sufficient to track relationships among 30 high-resolution images at a time. Based on Mistral Large 2 (a 123 billion-parameter large language model) and a 1 billion-parameter vision encoder, it demonstrates strong performance across several benchmarks (as reported by Mistral).

  • Mistral compared Pixtral Large to the open weights Llama 3.2 90B and the closed models Gemini-1.5 Pro, GPT-4o, and Claude-3.5 Sonnet. In Mistral’s tests (as opposed to the other model providers’ reported results, which differ in some cases), Pixtral Large achieved the best performance on four of eight benchmarks that involved analyzing text and accompanying visual elements.
  • For instance, on MathVista (math problems that involve visual elements, using chain-of-thought prompting), it achieved 69.4 percent accuracy, while Gemini 1.5 Pro, the next-best model in Mistral AI’s report, achieved 67.8 percent accuracy. (Claude 3.5 Sonnet outperforms Pixtral-Large on this benchmark according to Anthropic’s results. So do OpenAI o1 and Claude-3.5 Sonnet, according to their developers’ results, which Mistral did not include in its presentation.)
  • Pixtral Large powers new features of le Chat including PDF analysis for complex documents and a real-time interface for creating documents, presentations, and code, similar to Anthropic’s Artifacts and OpenAI’s Canvas. Le Chat also gained beta-test features including image generation (via Black Forest Labs’ Flux.1), web search with source citations (using Mistral’s proprietary search engine), and customizable agents that can perform tasks like scanning receipts, summarizing meetings, and processing invoices. These new features are available for free.

Behind the news: Pixtral Large arrives as competition intensifies among vision-language models. Meta recently entered the field with Llama 3.2 vision models in 11B and 90B variants. Both Pixtral Large and Llama 3.2 90B offer open weights, making them smaller and more widely available than Anthropic’s, Google’s, or OpenAI’s leading vision-language models. However, like those models, Pixtral Large falls short of the reported benchmark scores of the smaller, more permissively licensed Qwen2-VL 72B.

Why it matters: Pixtral Large and updates to le Chat signal that vision-language capabilities — combining text generation, image recognition, and visual reasoning — are essential to compete with the AI leaders. In addition, context windows of 129,000 tokens and above have become more widely available, making it possible to analyze lengthy (or multiple) documents that include text, images, and graphs as well as video clips.

We’re thinking: Mistral is helping to internationalize development of foundation models. We’re glad to see major developers emerging in Europe!

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox