Conversational AI • Last Updated 12/20/2024

Building low-latency voice assistants: Expert insights

Learn about the challenges and solutions behind building low-latency voice assistants from Telnyx engineers in our recent webinar.

By Buket Kusoglu

Developing a voice assistant that responds in real time isn’t just about crafting engaging dialogue. It’s about mastering the technology that makes every millisecond count. Low latency is the backbone of seamless voice interactions, where delays can break the illusion of natural conversation.

In a recent Telnyx webinar, our engineers shared strategies and insights for building voice assistants that are fast, reliable, and scalable. By integrating LLMs with real-time transcription, response generation, and speech synthesis, our engineers built a VA that can respond to users in under one second.

If you missed the live session, don’t worry. You can watch the recording below and find more information in this article to dive deep into the technical details and live demonstrations.


Why latency matters

Imagine asking a voice assistant for information and waiting several seconds for a response. This delay disrupts the conversational flow and leads to user frustration and disengagement. To deliver a truly seamless experience, our team set an ambitious goal: Reduce the Time to First Audio (TTFA)—the time it takes for the assistant to respond—to under 1000ms.

While our initial iteration recorded a TTFA of 6–10 seconds, they successfully reduced this result to an impressive 900ms by leveraging open-source solutions. Now let’s talk about how they did it.

The path to low latency

Our solution evolved through several stages, combining innovative approaches with advanced tools:

  • Leveraging Telnyx APIs: Tools like Call Transcription, Chat Completion, and Speak Command enabled rapid development and integration, allowing us to create a voice assistant framework developers could replicate.
  • Optimizing transcription and response: Upgrades like Distil-Whisper for transcription and better LLM hardware reduced delays. Streaming LLM outputs allowed us to process data incrementally, shaving seconds off response times.
  • Real-time media processing: Directly integrating with media servers provided granular control and helped address challenges like high latency and real-time adjustments.

But latency is just one part of the equation when building an effective, efficient VA.

User experience comes first

Creating a truly effective voice assistant requires addressing user interaction challenges. Our engineering team focused on two critical aspects of user interaction: interruption handling and noise management.

1. Interruption handling

Users often pause mid-sentence or want to interrupt the assistant. Early iterations struggled with this problem, either cutting users off prematurely or failing to stop when interrupted. To solve this issue, we built a machine-learning model to detect natural pauses and differentiate between pauses and the end of speech. This fix ensured smoother interactions, even when users needed a moment to think or rephrase.

2. Noise management

In noisy environments, background sounds can interfere with a voice assistant's functionality. By integrating the Silero Voice Activity Detection (VAD) model, we accurately identified when users were speaking versus when background noise was present. This solution allowed for a more robust and reliable interaction, ensuring that the assistant only responded to intentional input.

These improvements enhanced the assistant's reliability and demonstrated the transformative potential of Voice API and Voice AI technologies in addressing real-world challenges. But we couldn’t have done it without the right tools.

Powered by Elixir and Membrane

Our choice to use Elixir and the Membrane Framework was intentional. These tools provided a scalable and flexible foundation that allowed us to build a high-performance voice assistant while maintaining simplicity in implementation. The result is a system that’s robust, efficient, and developer-friendly.

Elixir and Membrane provide a strong technical backbone, but creating a truly effective voice assistant requires tools that are built for scale and simplicity—like those from Telnyx.

Build smarter, faster voice assistants with Telnyx

Building a low-latency voice assistant requires cutting-edge technology, as well as a thoughtful balance between speed, accuracy, and user-centric design. As we’ve explored, reducing latency ensures smooth, natural interactions. And prioritizing user experience turns technical efficiency into lasting satisfaction. By combining these principles, you can create a voice assistant that exceeds modern expectations.

The journey to low latency is challenging, but the right tools and insights can make all the difference. Whether you’re fine-tuning network paths or optimizing real-time audio processing, every decision contributes to the overall success of your solution. Staying focused on both technical precision and end-user needs will help you deliver an assistant that feels truly intuitive.

At Telnyx, we specialize in helping businesses build smarter, faster voice solutions. Our Voice AI tools and Voice API are designed to reduce latency, enhance clarity, and simplify implementation. With our global private network, you’ll gain the speed, reliability, and scalability needed to stay ahead in the voice technology space. If you’re ready to elevate your voice assistant with industry-leading tools and expertise, Telnyx is here to help you every step of the way.


Contact our team to power your voice solutions with real-time technology from Telnyx. Or check out the full webinar recording for a step-by-step guide to building your own low-latency voice assistant.
Share on Social

Related articles

Sign up and start building.