Researchers updated the highly responsive Moshi voice-to-voice model to discuss visual input.
What’s new: Amélie Royer, Moritz Böhle, and colleagues at Kyutai proposed MoshiVis. The weights are free to download under the CC-BY 4.0 license, which permits commercial and noncommercial uses. You can hear examples of its output and chat with a demo.
Key insight: The original Moshi, which manages overlapping voice-to-voice conversations, comprises two transformers. The first outputs a text transcription of its speech, and the second outputs speech. Since Moshi generates text as well as speech, the authors of that work fine-tuned it to predict the next token of text. In MoshiVis, the addition of a vision encoder enabled the authors to fine-tune on not only image-text datasets but also image-speech datasets, which are not so plentiful. Fine-tuning on this wider variety of images enabled the system to understand images better than fine-tuning it solely on image-speech datasets.
How it works: To Moshi, the authors added a model based on a pretrained SigLIP vision encoder to encode images, a cross-attention adapter to fuse image information with speech tokens, and vanilla neural networks trained to act as gates that determine how much image information to fuse. Specifically, the authors added the adapter and a gate between Moshi’s existing self-attention and fully connected layers.
- The authors fine-tuned MoshiVis on seven datasets. For instance, they produced a vision-speech-to-speech dataset by prompting two Mistral NeMo models to talk about an image from initial descriptions of images in the image-text datasets PixMo and DOCCI, then using a custom text-to-speech model to convert the text into speech. Another example: They used OCR-VQA, an image-text dataset for answering questions about images (no speech data involved).
- They fine-tuned MoshiVis to predict the next token of speech or text in their datasets, training only the newly added adapter and gates while keeping SigLIP and the two Moshi transformers frozen.
Results: MoshiVis is highly responsive in conversation with latency of roughly 50 milliseconds on a Mac Mini.
- Qualitatively, it handles transitions smoothly between talking about images and general conversation. However, it sounds more robotic than other recent voice generators.
- Quantitatively, the authors compared MoshiVis to the vision-language model PaliGemma fine-tuned to answer questions about images. Overall, MoshiVis prompted with audio (and images) performed less accurately than PaliGemma prompted with text (and images). For example, on OCR-VQA, MoshiVis achieved roughly 65 percent accuracy while PaliGemma achieved roughly 71 percent accuracy.
Behind the news: MoshiVis complements a small but growing roster of systems that combine vision with speech-to-speech. ChatGPT accepts and generates speech in response to camera views or a user’s phone screen. AnyGPT (open weights training and inference code) accepts or generates speech, text, images, and music. Similarly, Mini-Omni2 (open weights and inference code) accepts and generates text, speech, and images. The authors didn’t compare MoshiVis to these alternatives.
Why it matters: MoshiVis easily adapts a speech-to-speech model to work with a new type of media input. MoshiVis requires training only the adapters, while the earlier AnyGPT and Mini-Omni2, which can also discuss images via voice input and output, require training both adapters and the main model.
We’re thinking: Text-chat models respond appropriately when a user refers to a previous topic or something new, and MoshiVis does, too, in spoken interactions. Evaluations of this capability will become increasingly important as voice-to-voice becomes more widespread.