What I’ve Learned Building Voice Applications Best practices for building apps based on AI’s evolving voice-in, voice-out stack

Published
Reading time
3 min read
Diagram comparing direct audio generation with a foundation model vs. a voice pipeline using STT, LLM, and TTS.
Loading the Elevenlabs Text to Speech AudioNative Player...

Dear friends,

The Voice Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Over the past year, I’ve been working closely with DeepLearning.AI, AI Fund, and several collaborators on voice-based applications, and I will share best practices I’ve learned in this and future letters.

Foundation models that are trained to directly input, and often also directly generate, audio have contributed to this growth, but they are only part of the story. OpenAI’s RealTime API makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences. This is great for building quick-and-dirty prototypes, and it also works well for low-stakes conversations where making an occasional mistake is okay. I encourage you to try it!

However, compared to text-based generation, it is still hard to control the output of voice-in voice-out models. In contrast to directly generating audio, when we use an LLM to generate text, we have many tools for building guardrails, and we can double-check the output before showing it to users. We can also use sophisticated agentic reasoning workflows to compute high-quality outputs. Before a customer-service agent shows a user the message, “Sure, I’m happy to issue a refund,” we can make sure that (i) issuing the refund is consistent with our business policy and (ii) we will call the API to issue the refund (and not just promise a refund without issuing it).

In contrast, the tools to prevent a voice-in, voice-out model from making such mistakes are much less mature.

In my experience, the reasoning capability of voice models also seems inferior to text-based models, and they give less sophisticated answers. (Perhaps this is because voice responses have to be more brief, leaving less room for chain-of-thought reasoning to get to a more thoughtful answer.)

When building applications where I need a high degree of control over the output, I use agentic workflows to reason at length about the user’s input. In voice applications, this means I end up using a pipeline that includes speech-to-text (STT, also known as ASR, or automatic speech recognition) to transcribe the user’s words, then processes the text using one or more LLM calls, and finally returns an audio response to the user via TTS (text-to-speech). This STT → LLM/Agentic workflow → TTS pipeline, where the reasoning is done in text, allows for more accurate responses.

However, this process introduces latency, and users of voice applications are very sensitive to latency. When DeepLearning.AI worked with RealAvatar (an AI Fund portfolio company led by Jeff Daniel) to build an avatar of me, we found that getting TTS to generate a voice that sounded like me was not very hard, but getting it to respond to questions using words similar to those I would choose was. Even after a year of tuning our system — starting with iterating on multiple, long, mega-prompts and eventually developing complex agentic workflows — it remains a work in progress. You can play with it here.

Initially, this agentic workflow incurred 5-9 seconds of latency, and having users wait that long for responses led to a bad experience. To address this, we came up with the following latency reduction technique. The system quickly generates a pre-response (short for preliminary response) that can be uttered quickly, which buys time for an agentic workflow to generate a more thoughtful, full response. (We’re grateful to LiveKit’s CEO Russ d’Sa and team for helping us get this working.) This is similar to how, if you were to ask me a complicated question, I might say “Hmm, let me think about that” or “Sure, I can help with that” — that’s the pre-response — while thinking about what my full response might be.

I think generating a pre-response followed by a full response, to quickly acknowledge the user’s query and also reduce the perceived latency, will be an important technique, and I hope many teams will find this useful. Our goal was to approach human face-to-face conversational latency, which is around 0.3-1 seconds. RealAvatar and DeepLearning.AI, through our efforts on the pre-response and other optimizations, have reduced the system’s latency to around 0.5-1 seconds.

Months ago, sitting in a coffee shop, I was able to buy a phone number on Twilio and hook it up to an STT → LLM → TTS pipeline in just hours. This enabled me to talk to my own LLM using custom prompts. Prototyping voice applications is much easier than most people realize!

Building reliable, scaled production applications takes longer, of course, but if you have a voice application in mind, I hope you’ll start building prototypes and see how far you can get! I’ll keep building voice applications and sharing best practices and voice-related technology trends in future letters.

Keep building!

Andrew 

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox