Microsoft Tackles Voice-In, Text-Out Microsoft’s Phi-4 Multimodal model can process text, images, and speech simultaneously

Published

Mar 12, 2025

Reading time

3 min read

Microsoft debuted its first official large language model that responds to spoken input.

What’s new: Microsoft released Phi-4-multimodal, an open weights model that processes text, images, and speech simultaneously.

Input/output: Text, speech, images in (up to 128,000 tokens); text out (0.34 seconds to first token, 26 tokens per second)
Performance: State of the art in speech transcription. Comparable to similar models in other tasks
Knowledge cutoff: June 2024
Architecture: transformer, 5.6 billion parameters
Features: Text-image-speech processing, multilingual, tool use.
Undisclosed: Training datasets, output size
The company also released Phi-4-mini, an open weights 3.8 billion-parameter version of its biggest large language model (LLM), Phi-4. Phi-4-mini outperforms larger models including Llama 3.1 8B and Ministral-2410 8B on some benchmarks.
Availability/price: Weights are free to download for noncommercial and commercial use under a MIT license.

How it works: Phi-4-multimodal has six components: Phi-4-mini, vision and speech encoders as well as corresponding projectors (which modify the vision or speech embeddings so the base model can understand them), and two LoRA adapters. The LoRA adapters modify the base weights depending on the input: One adapter modifies them for speech-text problems, and one for vision-text and vision-speech problems.

The speech encoder is a Conformer (which combines convolutional layers with a transformer) and the speech projector is a vanilla neural network. They trained Phi-4-multimodal to convert 2 million hours of speech to text, modifying only the speech encoder and projector. They further trained the system to convert speech to text, translate speech to other languages, summarize speech, and answer questions about speech, modifying only the speech encoder and the speech-text LoRA adapter.
The vision encoder is based on a pretrained SigLIP-400M vision transformer, and the vision projector is a vanilla neural network. They trained the model to process text and images in four stages: (i) They trained Phi-4-multimodal to caption images, modifying only the vision projector. (ii) They trained the system on 500 billion tokens to caption images, transcribe text in images, and perform other tasks, modifying only the vision encoder and projector. (iii) They trained the system to answer questions about images, charts, tables, and diagrams and to transcribe text in images, modifying the vision encoder, project, and vision-text LoRA adapter. (iv) Finally, they trained the system to compare images and summarize videos, modifying only the vision projector and vision-text LoRA adapter.
To adapt Phi-4-multimodal for images and speech, they trained the system to generate the text responses to a subset of the text-vision data that had been converted to speech-image using a proprietary text-to-speech engine, modifying only the text-vision LoRA adapter, vision encoder, and vision projector.
Example inference: Given a question as speech and an image, the audio encoder and projector convert the speech to tokens, and the image encoder and projector convert the image into tokens. Given the tokens, Phi-4-multimodal, which uses the weights of Phi-4-mini modified by the vision-text/vision-speech LoRA adapter, generates a text response.

Results: The authors compared Phi-4-multimodal to other multimodal models on text-vision, vision-speech, text-speech tasks.

Across 11 text-vision benchmarks, Phi-4-multimodal came in fourth out of 11 models. It outperformed Qwen2.5-VL-3B, Claude 3.5 Sonnet, and GPT 4o-mini. It trailed Qwen2.5-VL-7B, GPT-4o, and Gemini-2 Flash.
Across four vision-speech benchmarks, Phi-4-multimodal outperformed by at least 6 percentage points Gemini-2.0-Flash, Gemini-2.0-Flash-Lite-preview, and InternOmni.
Phi-4-multimodal outperformed all competitors in Microsoft’s report (including Qwen2-audio, Gemini 2.0 Flash, and GPT-4o) at transcribing speech from text in three datasets. It also achieved competitive performance in speech translation, outperforming its competitors on two of four datasets.

Behind the news: This work adds to the growing body of models with voice-in/text-out capability, including the open weights DiVA model developed by a team led by Diyi Yang at Stanford University.

Why it matters: The architectural options continue to expand for building neural networks that process text, images, audio, and various combinations. While some teams maintain separate models for separate data modalities, like Qwen2.5 (for text) and Qwen2.5-VL) (for vision-language tasks), others are experimenting with mixture-of-expert models like DeepSeek-V3. Phi-4-multimodal shows that Mixture-of-LoRAs is an effective approach for processing multimodal data — and gives developers a couple of new open models to play with.

We’re thinking: Output guardrails have been built to ensure appropriateness of text output, but this is difficult to apply to a voice-in/voice-out architecture. (Some teams have worked on guardrails that screen audio output directly, but the technology is still early.) For voice-based applications, a voice-in/text-out model can generate a candidate output without a separate, explicit speech-to-text step, and it accommodates text-based guardrails before it decides whether or not to read the output to the user.

Subscribe to The Batch