Systems that translate between spoken languages typically take the intermediate step of translating speech into text. A new approach shows that neural networks can translate speech directly without first representing the words as text.
What’s new: Researchers at Google built a system that performs speech-to-speech language translation based on an end-to-end model. Their approach not only translates, it does so in a rough facsimile of the speaker’s voice. You can listen to examples here.
How it works: Known as Translatotron, the system has three main components: An attentive sequence-to-sequence model takes spectrograms as input and generates spectrograms in a new language. A neural vocoder converts the output spectrograms into audio waveforms. And a pre-trained speaker encoder maintains the character of the speaker’s voice. Translatotron was trained end-to-end on a large corpus of matched spoken phrases in Spanish and English, as well as phoneme transcripts.
Why it matters: The architecture devised by Ye Jia, Ron J. Weiss, and their colleagues offers a number of advantages:
- It retains the speaker’s vocal characteristics in the spoken output.
- It doesn't trip over words that require no translation, such as proper names.
- It delivers faster translations, since it eliminates a decoding step.
- Training end-to-end eliminates errors that can compound in speech-to-text and text-to-speech conversions.
Results: The end-to-end system performs slightly below par translating Spanish to English. But it produces more realistic audio than previous systems and plants a stake in the ground for the end-to-end approach.
The hitch: Training it requires an immense corpus of matched phrases. That may not be so easy to come by, depending on the languages you need.
Takeaway: Automatic speech-to-speech translation is a sci-fi dream come true. Google’s work suggests that such systems could become faster and more accurate before long.