Inference latency is a systems problem, not just a model problem. The model is one of three layers, and for real-time AI it is rarely the layer with the most latency to give back.
Most teams treat inference latency as a model problem. They quantize, prune, and distill, then their voice agent still feels slow, their fraud system still misses the window, and users still bounce.
Inference latency is a systems problem, not just a model problem. The model is one of three layers, and for real-time AI it is rarely the layer with the most latency to give back.
This guide breaks down where the time actually goes, why network hops are often the hidden bottleneck, and how to architect around them.
Inference latency is the time between when a model receives an input and when it produces an output. It's measured in milliseconds, and for production AI it's the metric users actually feel.
It's worth distinguishing inference latency from adjacent concepts. Training time is how long it takes to train a model on a dataset (offline). Network latency is the time data spends in transit between systems. In practice, the "inference latency" most engineers care about includes a slice of network time on either side of the model computation.
Three types of latency compound in any deployed AI system: model latency (computation time), hardware latency (substrate execution time), and network latency (transit time between services). End-to-end latency is the sum.
The covers the basics. outline what to actually measure: time to first token (TTFT), inter-token latency (ITL), and end-to-end request latency.
Related articles
Slow AI is broken AI. The threshold at which "slow" becomes "broken" depends on the use case, but the curve is steeper than most teams assume.
Human conversation is the clearest evidence. Across 10 languages spanning five continents, the average gap between conversational turns is roughly 200 milliseconds, with most clustering within ~250 ms of that mean. Anything longer registers as a pause; past ~500 ms, listeners start to feel that something is wrong, even if they can't articulate why. This is hard-wired behavior shown in psycholinguistic research on turn-taking timing.
For voice AI, that ~200 ms target sets the bar.
For e-commerce and web applications, the cost shows up as lost revenue. Portent analyzed 100M+ page views and found conversion rates drop from ~40% at one-second load times to ~29% at three seconds. Cloudflare's review shows similar patterns: even a two-second delay in page rendering correlates with measurable revenue loss per visitor.
The takeaway isn't that one number applies everywhere. It's that latency has a price, denominated in users who leave, calls that fail, and trades that miss the window.
Most articles on inference latency focus on the model. That's the wrong lens.
Model layer: the compute needed to produce a result, scaling with parameter count, FLOPs, and architecture choices. A 70B model performs more arithmetic than a 7B model.
Hardware layer: the substrate the model runs on. Inference is almost always faster on a GPU than a CPU for models larger than a few hundred million parameters because decode is typically memory-bandwidth-bound; GPUs offer far higher memory bandwidth and better parallelism. KV-cache size, batching, and speculative decoding also matter.
Network layer: the time data spends in transit, from the user to the inference endpoint, between services in your stack, and back. On a single machine, this layer is negligible; on real-time, multi-vendor stacks, it's often the largest single contributor to end-to-end latency, and the one most teams neglect to measure.
For real-time AI, the network layer is frequently the biggest lever.
Here's the back-of-the-envelope math.
The speed of light in fiber is roughly two-thirds of c, about 200,000 km/s. Standard fiber-latency calculations put propagation delay near ~5 ms per 1,000 km one-way. A round trip from the U.S. East Coast to Sydney is roughly 16,000 km each way, so the physical minimum round-trip time is ~160 ms. No amount of model optimization changes that.
On top of physics, every vendor boundary in your stack adds processing overhead. A typical voice-AI pipeline routes audio from a telephony provider to a speech-to-text service, then to an LLM provider, then to a text-to-speech service, and back to the user. Each handoff typically adds 30 to 80 ms for serialization, authentication, queueing, and routing, depending on provider, region, and load. With five hops, that's ~150 to 400 ms of overhead before any model performs a single matrix multiplication.
This is why multi-vendor orchestration has a ceiling. Some providers report end-to-end response times in the ~600 ms range, which reflects careful optimization on top of a multi-vendor architecture. Those numbers are real and the result of careful engineering, but they also include the unavoidable network hops between the orchestration layer and the underlying STT, LLM, and TTS providers. The architecture itself becomes the constraint.
The implication: the largest pool of latency in a real-time AI system is often the part nobody owns, the connective tissue between providers. This is where the telecom edge matters: inference colocated with carrier infrastructure, not bolted onto it.
We outline an architectural fix, colocating inference with telephony points of presence (PoPs), in our piece on colocated infrastructure for voice AI.
There are three families of techniques. Most teams over-invest in the first two and under-invest in the third.
| Layer | Technique | Typical impact | Engineering effort |
|---|---|---|---|
| Model | Quantization (INT8, INT4) | ~1.5x to 2x speedup, modest quality cost | Low |
| Model | Speculative decoding | ~1.5x to 3x speedup, no quality cost | Medium |
| Model | Distillation | ~2x to 5x speedup, real quality cost | High |
| Hardware | GPU upgrade or batching | ~1.5x to 4x throughput | Low to medium |
| Infrastructure | Regional deployment | ~50 to 300 ms round-trip reduction | Low |
| Infrastructure | Colocated inference | Can remove ~150 to 400 ms of vendor-hop overhead | Low (with the right vendor) |
| Perceived UX | Streaming STT/TTS + partial responses | Big perceived latency win (TTFT drops dramatically) | Low to medium |
Model optimization has mature tooling. Quantization to INT8 or INT4 cuts memory-bandwidth requirements. Speculative decoding uses a small draft model to propose tokens that a larger verifier model checks in parallel. Distillation trains a smaller student model to mimic a larger teacher.
Hardware optimization is the next lever. Selecting a GPU with adequate memory bandwidth matters more than peak FLOPs for most LLM workloads. Continuous batching, paged attention, and KV-cache management squeeze more useful work out of the silicon you have.
Infrastructure optimization is where the largest gains usually live, and it's the layer most teams treat as someone else's problem. The two highest-impact moves are deploying inference in the same region as your users and reducing vendor handoffs by running pipeline components in the same facility. The first reduces physics. The second reduces overhead. Both tend to deliver more milliseconds back than any quantization scheme.
The economics follow from the architecture. Fewer hops means fewer billed components. Telnyx TTS starts at $0.000003/char, 10x less than ElevenLabs. SIP at $0.005/min, half Twilio's rate. These aren't promotional discounts; they're structural, the result of owning infrastructure instead of renting it.
Telnyx's Inference API is built around this principle.
Cloud inference is centralized: a single region serves users across a continent, and latency varies with the user's distance from that region.
Edge inference is distributed: compute lives closer to where data is generated, and latency stays more consistent across geographies.
If users are concentrated in one region, cloud inference can work well. With global users or strict latency budgets, physics works against you. A user in Singapore calling a model hosted in Virginia incurs ~200 ms of round-trip time before anything else happens.
Edge inference mitigates the distance problem by deploying compute regionally, often colocated with the network infrastructure already serving users in that region.
Comparative analysis from InfoWorld finds that edge deployments consistently deliver lower and more predictable latency, with the added benefits of lower bandwidth costs and stronger data sovereignty. Mirantis covers the architecture in detail in its guide to edge AI inference, and we cover the trade-offs from a communications perspective in our edge vs. cloud comparison.
The Telnyx Global Edge Router routes traffic to the nearest regional inference point automatically, so engineers don't have to make the deployment decision call by call.
Voice AI is a canonical case where inference latency makes or breaks the product. The latency budget is short, the failure mode is obvious, and architectural seams are most exposed.
A typical voice-AI pipeline, per turn: capture audio, transmit to a telephony provider, run speech-to-text, send the transcript to an LLM, generate a response, run text-to-speech, and play the audio back. If the user is on one continent and any service is on another, you pay tens to hundreds of milliseconds in pure transit cost on each leg. The ~600 ms benchmark often cited as industry-leading reflects teams that optimize aggressively on top of a multi-vendor architecture; the seams dominate.
Other real-time use cases face the same physics. Algorithmic trading, fraud detection, autonomous systems, real-time recommendations: all have latency budgets measured in tens or hundreds of milliseconds, and all suffer when inference is separated from the data source by a long network path.
We benchmarked leading providers in our voice AI latency comparison; the pattern is consistent: where inference runs dominates latency outcomes.
You can't optimize what you don't measure. For LLM workloads, focus on four metrics:
Time to first token (TTFT): time from request to the first byte of output, what users perceive as responsiveness.
Inter-token latency (ITL): average time between successive tokens once generation has started.
End-to-end latency: full duration from request to final token.
Throughput: tokens or requests served per second.
P50 vs. P95/P99 matter, too. The median reflects a typical user; the tail reflects your worst-served users, and in production, the tail often drives churn.
Inference latency is a systems problem. The model and hardware layers both matter, and both have well-developed optimization playbooks. But for real-time AI, the network layer is where the largest pool of latency lives, and vendor handoffs make it worse.
If you're building voice AI or any real-time application where inference latency directly shapes the user experience, the most leverage is in the architecture, not the model.
This is what AI Agent Infrastructure looks like: carrier-grade telephony, colocated inference, and global communications operating as one system. Telnyx reduces the vendor hops that make most stacks slow.