Qwen’s mid-sized reasoning model scores big Sesame moves through speech models’ “uncanny valley”

Published
Mar 7, 2025
Reading time
3 min read
A man sitting side by side with his computer at a bar as if they are having a friendly conversation.

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

  • Cohere’s open vision models support many languages
  • Jamba 1.6’s two hybrid MoE models promise more speed
  • Anthropic overhauls its developer console for Claude Sonnet 3.7
  • Mistral brings its multilingual/multimedia skills to OCR

But first:

Qwen applies reinforcement learning to a smaller language model

Alibaba’s Qwen released QwQ-32B, a 32 billion parameter reasoning model that matches the performance of larger models like DeepSeek-R1 and o1-mini. The model, based on Qwen 2.5, excels at tasks like mathematical reasoning and coding, and incorporates agent-like capabilities for self-criticism and tool use. The model is available for download at Hugging Face and ModelScope, and in environments like Ollama, under an Apache 2.0 license. QwQ-32B shows the potential of scaled reinforcement learning to powerfully enhance AI capabilities, even with relatively modest model sizes. (GitHub and Hugging Face)

Sesame unveils expressive, context-aware speech system

Sesame introduced the Conversational Speech Model (CSM), an end-to-end multimodal learning system designed to generate more natural and contextually appropriate AI speech. The model uses transformers to process both text and audio inputs, leveraging conversation history to produce coherent speech with improved expressivity and efficiency. Sesame’s work addresses limitations in current text-to-speech systems and aims to create AI companions with “voice presence” that can engage in genuine dialogue. The company released a demo and made its models available under an Apache 2.0 license. (Sesame)

Cohere releases multilingual vision-language models

Cohere introduced Aya Vision, a family of open weight multimodal models designed to understand language and images across 23 languages. The 8B and 32B parameter models outperform larger competitors on multilingual benchmarks by leveraging techniques like synthetic annotations, data scaling, and multimodal model merging. This release adds strong multilingual capabilities to multimodal AI models, potentially enabling more inclusive and globally accessible AI applications. (Cohere and Hugging Face)

Jamba’s hybrid architecture gets a model update

AI21 Labs released Jamba 1.6, an open language mixture-of-experts model family with a hybrid Mamba-transformer architecture. The company reports that Jamba Large 1.6 (398 billion parameters, 94 billion active) outperforms Mistral Large 2, Llama 3.3 70B, and Command R+ on ArenaHard, LongBench, and other benchmarks, while Jamba Mini 1.6 (52 billion parameters, 12 billion active) surpasses Ministral 8B, Llama 3.1 8B, and Command R7B. AI21 Labs highlights Jamba 1.6’s 256K token context window, its speed, and its performance on RAG and long-context question-answering tasks. (AI21 Labs and Hugging Face)

Anthropic upgrades developer console with new collaboration features

Anthropic redesigned its console to streamline AI development with Claude, adding features like shareable prompts for team collaboration and support for the Claude 3.7 Sonnet model. The console now offers tools to write, evaluate, and optimize prompts, including automatic prompt generation and refinement capabilities. These upgrades aim to help developers build more reliable AI applications by improving prompt quality and enabling better teamwork across organizations. (Anthropic)

Mistral introduces new OCR API for advanced document processing

Mistral OCR extracts content from complex text-and-image documents, outperforming competitors like Microsoft Azure and Gemini 2.0 Flash in speed and accuracy benchmarks across various document types and languages. The API processes up to 2000 pages per minute, supports document-as-prompt functionality, and offers structured output options. Improved OCR enables organizations to unlock insights from their document repositories, potentially accelerating research, preserving cultural heritage, and improving customer service –  as well as making it easier for them to be processed by AI. (Mistral)


Still want to know more about what matters in AI right now?

Read this week’s issue of The Batch for in-depth analysis of news and research.

This week, Andrew Ng discussed the challenges of Voice Activity Detection (VAD) in noisy environments and highlighted Moshi, a model that continuously listens and decides when to speak, eliminating the need for explicit turn-taking detection. He emphasized ongoing innovations in voice AI and the potential for improved voice-to-voice interactions.

“Given the importance of foundation models with voice-in and voice-out capabilities, many large companies right now are investing in developing better voice models. I’m confident we’ll see many more good voice models released this year.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: Mercury Coder released a fast text generator with a non-transformer architecture, introducing what may be the first commercially available Language Diffusion Model; OpenAI unveiled GPT-4.5, its most powerful non-reasoning model to date, promising enhanced performance and efficiency; Claude 3.7 Sonnet introduced a budget for reasoning tokens, a hybrid approach to reasoning models; and Amazon launched Alexa+, integrating generative AI and intelligent agents powered by Claude and other models to create a more advanced voice assistant.


Subscribe to Data Points

Share

Subscribe to Data Points

Your accelerated guide to AI news and research