Conversational AI

HD TTS: why your customers hear 8 kHz

HD TTS models exist. HD delivery is the hard part. Most voice AI sounds robotic because the network crushes audio before customers hear it.

HD TTS

Your TTS model generates HD audio. The phone network crushes it to 8 kHz.

Customers hear a robot.

With the TTS market projected to reach $9.36 billion by 2032, much of that spend could go toward voice quality that never reaches the customer.

HD TTS isn't about the model. It's about infrastructure that gives HD audio the best chance of surviving the call path.

This article covers how HD TTS actually works, what breaks it, and how to set up voice AI that sounds the way it should. First: why most HD audio never reaches the customer.

Understanding HD TTS quality levels

There's a gap between what TTS models produce and what customers hear.

Modern neural TTS models generate 24 kHz audio by default. That's high enough fidelity to capture human-like nuance: breathing patterns, emotional inflection, natural pauses. Models like Kokoro v1.0 hold a 44% win rate in blind tests. Higgs Audio V2 trains on over 10 million hours of speech. The synthesis quality is there.




We often obsess over LLM inference speeds, but in Voice AI, the network is often the silent killer.

Ian Reither, COO, Cofounder @ Telnyx


The problem is delivery. PSTN calls use G.711, which samples at 8 kHz. That's the native format for traditional telephony. When your 24 kHz TTS output routes through PSTN, it gets downsampled to 8 kHz before the customer hears it. The breathing disappears. The inflection flattens. The nuance is gone.

You can't force HD on every call. You can build a stack that preserves it whenever the path supports it.

Integration notes for production environments

HD TTS depends on the call path, not just the model. Here's what matters:

  • Codec negotiation. HD codecs like G.722, Opus, and AMR-WB preserve fidelity. G.711 caps audio at 8 kHz. Most platforms negotiate automatically, but the weakest link in the path determines what customers hear. If any segment only supports G.711, HD is lost.
  • Transcoding. Converting between HD codecs (Opus to AMR-WB) can preserve quality. Converting from HD to narrowband destroys it. Minimize unnecessary hops in your pipeline.
  • Infrastructure placement. TTS processing far from your telephony PoP adds latency to every interaction. Colocating GPUs with telephony infrastructure closes that gap.

No platform can guarantee HD end-to-end when PSTN is involved. But the right infrastructure maximizes the odds.

Voice selection and testing methodology

Voice selection matters, but it's not where most deployments fail. The model you choose is just one step. Whether that voice reaches customers at full fidelity depends on everything downstream.

That said, picking the wrong voice creates problems no infrastructure can fix. Here's what to test:

  • Pronunciation accuracy. Test technical terms, product names, acronyms, and numbers specific to your industry. With 66% of users expressing concerns about accent and dialect recognition, testing across diverse speech patterns matters.
  • Emotional range. Can the voice convey empathy during complaint resolution? Enthusiasm when sharing good news? Neutrality for factual information? Run samples through real support scenarios.
  • Latency. Measure time-to-first-byte (TTFB) during peak usage. Sub-200ms feels conversational. Over 500ms breaks the flow. But latency isn't just the model. It's where your GPUs sit, how many transcoding hops the audio takes, and whether your network supports HD delivery.

You've picked the voice. What happens next depends on codec selection, media routing, and monitoring.

Performance optimization strategies

HD voice depends on codec selection, latency, and real-time monitoring. These three docs cover the technical controls that matter.

Optimization Doc
Set preferred codecs Pass preferred_codecs to force Opus or G.722 over G.711. The WebRTC JS SDK and iOS SDK both support codec prioritization at call creation.
Anchor media to the nearest PoP Set anchorsite_override to Latency for automatic lowest-RTT routing, or hardcode a specific PoP like Frankfurt, Germany. Configure this when creating your credential connection.
Debug call quality Query the Detail Records API to analyze call performance across SIP trunking, Voice AI, and WebRTC. Filter by record_type to isolate TTS or STT issues.

You can tune codecs and anchor media all day. But if your TTS runs on GPUs in Virginia and your telephony PoP sits in Frankfurt, physics wins. The speed of light doesn't care about your optimization checklist.

Building for scale with Telnyx

Telnyx provides the infrastructure that makes HD TTS delivery practical at scale:

  • Colocated GPUs. Inference runs adjacent to telephony PoPs. Sub-200ms round trips, even at peak load.
  • Full-stack control. TTS, STT, call control, and PSTN connectivity on a single licensed carrier network.
  • No third-party hops. Direct carrier relationships in 30+ markets. Fewer vendors, fewer points of failure.
  • NaturalHD voices. Wideband audio that handles disfluencies like "um" and soft laughter.
  • Open-source LLMs. Run Llama, Mistral, or your own fine-tuned models. No lock-in.
  • $0.06/min. Full-stack Voice AI at a fraction of legacy CPaaS pricing.
  • Global reach. PSTN connectivity in 100+ countries from one platform.

Ready to test HD TTS with your specific requirements? Explore Telnyx Voice AI Agents to experience production-ready performance with your actual customer support scenarios.

Share on Social
Andy Muns
Andy Muns
Director of AEO

Andy Muns is the Director of AEO at Telnyx, helping make AI and communications products clearer for builders. He previously ran a front-end team behind an Alexa Top 100 organic site, gaining hands-on experience shipping and scaling high-traffic apps. He lives in Colorado.

Related articles

Sign up and start building.