A new system marks a step forward in converting text to speech: It’s fast at inference, reduces word errors, and provides some control over the speed and inflection of generated speech.
What’s new: Yi Ren, Yangjun Ruan, and their co-authors at Zhejiang University and Microsoft propose FastSpeech, a text-to-speech system that processes text sequences in parallel rather than piece by piece.
Key insight: Previous models predict phonemes, or units of sound, sequentially. This so-called autoregressive approach lets the model base each phoneme on those that came before, so the output can flow like natural speech. But it also limits how fast the model can generate output. Instead, FastSpeech uses a duration predictor that determines the length of each phoneme. Knowing durations ahead of time allows the model to generate phoneme representations independently, yielding much faster operation while maintaining the flow.
How it works: Neural text-to-speech models typically generate a mel-spectrogram that represents the frequency spectrum of spoken words. FastSpeech generates mel-spectrograms using a variant on the transformer network known as a feed-forward transformer network (abbreviated FFT, but not to be confused with a fast Fourier transform).
- The model starts by splitting words into the phonemes they represent. A trainable embedding layer transforms the phonemes into vectors.
- The first of two FFTs applies attention to find relationships between the phonemes and generate a preliminary mel-spectrogram.
- The duration predictor (trained by a separate pretrained autoregressive text-to-speech model) determines the length of any given phoneme in spoken form. A length regulator adjusts the FFT’s output to match the predicted durations.
- A second FFT sharpens details of the mel-spectrogram, and a linear layer readies it for final output.
- The WaveGlow speech synthesizer produces speech from the final mel-spectrogram.
Results: Using the LJSpeech dataset for training and evaluation, FastSpeech was 270 times faster at generating mel-spectrograms than a transformer-based autoregressive system, and 38 times faster at generating speech output, with audio quality nearly as good. The generated speech was free of repetitions and omissions.
Why it matters: LSTMs and other autoregressive models have boosted accuracy in generating text and speech. This work highlights an important trend toward research into faster alternatives that don’t sacrifice output quality.
We’re thinking: In the long run, end-to-end systems that synthesize the output audio directly are likely to prevail. Until then, approaches like FastSpeech still have an important role.