A telephony provider. A transcription service. An LLM in some cloud region. A text-to-speech engine somewhere else. Each connection adds latency.
This results in awkward pauses that immediately signal "I'm talking to a bot."

We've run voice infrastructure for over a decade, processing billions of call minutes for hospitals, financial institutions, and global tech platforms. What we found is that the biggest bottleneck in voice AI is the network hops between them.
Humans respond within 200 milliseconds in conversation. When response time exceeds 300-500ms, conversations feel unnatural. Above 1200ms? Users hang up.
Here's where most platforms fail. Each service in the chain adds 20-50ms of network delay before any AI processing happens.
A typical call flow:
That's 250ms minimum in network hops alone.
Now add STT processing (100-300ms), LLM inference (350-1000ms), and TTS synthesis (90-200ms). You're at 800ms - 1.5 seconds total.
Research shows customers hang up 40% more frequently when agents take longer than 1 second to respond.
Contact centers report lower satisfaction scores when delays exceed 500ms.
Recent benchmarks confirm the impact.

Twilio's voice channel shows average latency of 950ms, reflecting the overhead of extensive carrier integrations.
Vonage faces similar challenges, with latency ranging from 800-1200ms.
Most AI platforms treat telephony as an afterthought.
Something you bolt on via a third-party CPaaS provider. This adds another layer.
Base telephony latency within the same region is around 200ms. For global calls (Asia to US), that jumps to 500ms just for audio transport. If your phone number is registered in a different region, you're adding even more hops as the call routes through your "home" country's network.
When you chain together separate services for each step, delays stack up and unpredictability increases.
When one service hits rate limits or experiences a regional outage, the entire call fails.
Because different vendors handle different parts of the stack, troubleshooting becomes a finger-pointing exercise across multiple dashboards.
Instead of stitching together services, we co-located our GPU infrastructure with our telephony network, in the same data centers.
Audio from an incoming call hits our transcription models, LLM inference, and text-to-speech engines without leaving our private network.
This means there are zero external API calls, no cross-cloud data transfer, and no unpredictable jitter from the public internet.
This enables a response time of less than one second from the moment a user finishes speaking until they hear our reply.
We've deployed this in our US, Australia and European (Paris) regions, with expansion to MENA underway.

Each region gets dedicated GPU clusters positioned directly adjacent to our telephony core.
We are also a licensed carrier in 30+ markets and operate a private MPLS fiber backbone connecting 17 global points of presence.
So when you deploy a voice AI agent with us, audio is directly connected to our owned PSTN connections, gets processed by regional GPUs, and returns through the same network.
A private network solution like this can offer up to 40% reduction in call setup times and improved audio quality in challenging network environments.
With no vendor handoffs we also get a single observability plane from RTP packets to model outputs.
Compare this to platforms that rent infrastructure. They're routing audio through a CPaaS provider's gateway, public internet, a cloud provider's compute region, then another CPaaS hop back.
Each hop introduces jitter, potential packet loss, and variable latency you can't control.
If you're building demos or prototyping, latency might not matter yet. But for production voice AI agents handling real interactions at scale, it determines whether users trust the system.
In customer support, 800ms of lag causes users to talk over the agent, which breaks intent recognition and forces conversation loops.
For multi-turn workflows like appointment booking, the pauses make the agent feel slow, leading to abandoned calls.
This is critical in healthcare or finance, where sensitive data is handled, trust is paramount and easily eroded by delays. If an agent takes even half a second to acknowledge a query before checking an account balance or verifying a transaction, user trust is already lost.
If you're evaluating voice AI platforms, here's what we'd benchmark:
We've measured latencies ranging from 200ms (co-located stacks like ours) to 1500ms+ (platforms relying on multiple third-party services). The difference shows up immediately in user behavior.
Sub second roundtrip latency is the difference between a system users trust and one they hang up on.
We built the only stack that delivers it consistently because we own every layer from the PSTN to the GPU.
How are you benchmarking latency in your own stack?.
Related articles