Equally Fluent in Many Languages Cohere’s Aya Vision beats multilingual rivals in text & image understanding

Published
Reading time
3 min read
AYA Vision architecture diagram showing vision encoder, multimodal merging, and LLM backbone for image processing
Loading the Elevenlabs Text to Speech AudioNative Player...

Multilingual AI models often suffer uneven performance across languages, especially in multimodal tasks. A pair of lean models counters this trend with consistent understanding of text and images across major languages.

What’s new: A team at Cohere led by Saurabh Dash released Aya Vision, a family of multilingual vision-language models with downloadable weights in 8 billion- and 32-billion-parameter sizes. 

  • Input/output: Text and images in (up to 2,197 image tokens, up to 16,000 tokens total), text out (up to 4,000 tokens).
  • Availability: Free via WhatsApp or Cohere Playground. Weights available to download, but licensed only for noncommercial uses.
  • Features: Multilingual input and output in 23 languages. 
  • Undisclosed: Knowledge cutoff, training datasets, adapter architecture.

How it works: Each model comprises a pretrained large language model (Aya Expanse for the 32B model, C4AI Command R7B for the 8B version), a pretrained vision encoder (SigLIP 2), and a vision-language adapter (“connector”) of unspecified architecture.

  • To establish basic vision-language understanding, the team froze the vision encoder and language model and trained the vision-language connector.
  • They fine-tuned the vision-language connector and language model on multimodal tasks. To build the fine-tuning dataset, they generated synthetic annotations for various English-language datasets and translated a large amount of data into a variety of languages. They rephrased the translations to add fluency and variety, particularly for languages with little real-world data, by matching generated pairs with the original synthetic samples.
  • They merged the language model with the fine-tuned vision-language model using an undisclosed method that preserved text capabilities while adding vision understanding.
  • After proving this method for 8 billion parameters, they scaled up the recipe to 32 billion parameters.

Performance: To test the model, the team built and released two benchmarks: m-WildVision, a multilingual version of Wild Vision Bench’s arena-style competition for discussion of images, and AyaVisionBench, 135 image-question pairs in each language that cover nine tasks including captioning images, understanding charts, recognizing characters in images, visual reasoning, and converting screenshots to code. On these two benchmarks, Aya Vision 8B and 32B outperformed larger competitors, as judged by Claude 3.7 Sonnet.

  • In head-to-head competitions on AyaVisionBench, Aya Vision 8B won up to 79 percent of the time against six competitors of similar size. On m-WildVision, it achieved 81 percent when compared to vision-language models of similar size including Qwen2.5-VL 7B, Pixtral 12B, Gemini Flash 1.5 8B, and Llama-3.2 11B Vision. Aya Vision 8B won 63 percent of the time against Llama-3.2 90B Vision, a model more than 10 times its size.
  • On both benchmarks, Aya Vision 32B outperformed vision-language models more than twice its size including Llama-3.2 90B Vision, Molmo 72B, and Qwen2.5-VL 72B. On AyaVisionBench, it won between 50 and 64 percent of the time. On WildVision, it achieved win rates between 52 percent and 72 percent across all languages.

Behind the news: Aya Vision builds on the Cohere-led Aya initiative, a noncommercial effort to build models that perform consistently well in all languages, especially languages that lack high-quality training data. The project started with a multilingual text model (Aya Expanse), added vision (Aya Vision), and plans to eventually add video and audio.

Why it matters: Multilingual vision-language models often perform less well in low-resource languages, and the gap widens when they process media other than text. Aya Vision’s recipe for augmenting synthetic data with successively refined translations may contribute to more universally capable models. Aya Vision is available on the global messaging platform WhatsApp, where it can be used to translate text and images in all 23 of its current languages.

We’re thinking: Multilingual vision models could soon help non-native speakers decipher Turkish road signs, Finnish legal contracts, and Korean receipts. We look forward to a world in which understanding any scene or document is as effortless in Swahili as it is in English.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox