Even cutting-edge, end-to-end, speech-to-speech systems like ChatGPT’s Advanced Voice Mode tend to get interrupted by interjections like “I see” and “uh-huh” that keep human conversations going. Researchers built an open alternative that’s designed to go with the flow of overlapping speech.
What’s new: Alexandre Défossez, Laurent Mazaré, and colleagues at Kyutai, a nonprofit research lab in Paris, released Moshi, an end-to-end, speech-to-speech system that’s always listening and always responding. The weights and code are free for noncommercial and commercial uses under CC-BY 4.0, Apache 2.0, and MIT licenses. You can try a web demo here.
Key insight: Up to 20 percent of spoken conversation consists of overlapping speech, including interjections like “okay” and “I see.”
- To respond appropriately despite such overlaps, a system must both listen and generate sound continuously — although much of what it will generate is silence.
- To respond without delay, it must keep latency to a minimum. This goal requires an end-to-end design rather than a pipeline of stand-alone models to perform voice detection, speech-to-text, text processing, and text-to-speech in turn.
How it works: The authors combined an encoder-decoder called Mimi and an RQ-Transformer, which is made up of the Helium transformer-based large language model (LLM) plus another transformer.
- Mimi’s encoder embedded spoken input using 8 audio tokens per timestep (80 milliseconds). The authors trained Mimi on 7 million hours of mostly English speech from undisclosed sources. The training involved two loss terms. (i) The first loss term encouraged Mimi, given one audio timestep, to produce audio that fooled a pretrained MS-STFT discriminator into thinking it was human speech. The second loss term distilled knowledge from a pretrained WavLM, an audio embedding model. It encouraged Mimi’s encoder, when Mimi and WavLM received the same audio timestep, to produce one audio token (of its 8 audio tokens per timestep) whose embedding was similar to the corresponding embedding produced by WavLM.
- Given the audio tokens, the Helium LLM produced text tokens that were used internally to help the additional transformer predict the next audio token (the idea being that the LLM’s skill with words would inform which audio token to generate next). The authors trained Helium to predict the next text token in 2.1 trillion tokens of English text (12.5 percent from Wikipedia and Stack Exchange, and the remaining 87.5 percent from Common Crawl).
- RQ-Transformer received many sets of 17 tokens per time step: 8 audio tokens encoded by Mimi from the audio input, 8 audio tokens from Moshi’s previously generated audio output, and 1 text token produced by Helium. RQ-Transformer learned to predict the next set of 17 tokens in 7 million hours of audio and transcribed text.
- To train the system specifically on conversational interaction, the authors further trained it to predict the next token in 2,000 hours of recorded phone conversations between randomly paired participants.
- At inference, given a user's speech, Mimi turned it into audio tokens. Given the audio tokens and RQ-Transformer’s previously generated audio and text tokens, RQ-Transformer generated new audio and text tokens. From the generated audio tokens, Mimi produced synthetic speech.
Results: In tests, Moshi proved fast and relatively accurate.
- Moshi (7 billion parameters) took around 200 milliseconds to respond to user input. In comparison, GPT-4o, which also produces speech output directly from speech input, took 232 milliseconds minimum (320 milliseconds average). Prior to GPT-4o, ChatGPT Voice Mode (a pipeline of speech-to-text, text-to-text, and text-to-speech models) took an average of 5.4 seconds.
- Moshi achieved 26.6 percent accuracy on Web Questions, higher than the speech-to-text-to-speech models tested by the authors: Spectron (1 billion parameters) achieved 6.1 percent accuracy and SpeechGPT (7 billion parameters) achieved 6.5 percent accuracy. The authors didn’t provide comparable results for GPT-4o or ChatGPT Voice.
Why it matters: While a turn-based approach may suffice for text input, voice-to-voice interactions benefit from a system that processes both input and output quickly and continuously. Previous systems process input and output separately, making users wait. Moshi delivers seamless interactivity.
We’re thinking: Generating silence is golden!