Models that achieve state-of-the-art performance in automatic speech recognition (ASR) often perform poorly on nonstandard speech. New research offers methods to make ASR more useful to users with heavy accents or speech impairment.
What’s new: Researchers at Google fine-tuned ASR neural networks on a data set of heavily accented speakers, and separately on a data set of speakers with amyotrophic lateral sclerosis (ALS), which causes slurred speech of varying degree. Their analysis shows marked improvement in model performance. The remaining errors are consistent with those associated with typical speech.
Key insight: Fine-tuning a small number of layers closest to the input of an ASR network produces good performance in atypical populations. This contrasts with typical transfer learning scenarios, where test and training data are similar but output labels differ. In those scenarios, learning proceeds by fine-tuning layers closest to the output.
How it works: Joel Shor and colleagues used data from the L2-ARCTIC data set for accented speech and ALS speaker data from the ALS Therapy Development Institute. They experimented with two pre-trained neural models, RNN-Transducer (RNN-T) and Listen-Attend-Spell (LAS).
- The authors fine-tuned both models on the two data sets with relatively modest resources (four GPUs over four hours). They measured test-set performance on varying amounts of new data.
- They compared the sources of error in the fine-tuned models against models trained on typical speech only.
Results: RNN-T achieved lower word error rates than LAS, and both substantially outperformed the Google Cloud ASR model for severe slurring and heavily accented speech. (The three models were closer with respect to mild slurring, though RNN-T held its edge.) Fine-tuning on 15 minutes of speech for accents and 10 minutes for ALS brought 70 to 80 percent of the improvement.
Why it matters: The ability to understand and act upon data from atypical users is essential to making the benefits of AI available to all.
Takeaway: With reasonable resources and additional data, existing state-of-the-art ASR models can be adapted fairly easily for atypical users. Whether transfer learning can be used to adapt other types of models for broader accessibility is an open question.