Practical synthetic speech detection for real world voice: What works, what breaks, and how to deploy it.
AI-generated voice has crossed a threshold. Text-to-speech (TTS) and voice conversion (VC) systems now produce audio that most human listeners cannot reliably distinguish from authentic speech. Research shows that people correctly identify high-quality deepfake audio only about 24.5% of the time, which means attackers have the statistical advantage in nearly every social engineering scenario.
The financial impact is already staggering. Deloitte's Center for Financial Services projects that generative AI could enable fraud losses to reach $40 billion in the United States by 2027, up from $12.3 billion in 2023. Voice cloning fraud specifically rose 680% year over year according to Pindrop's 2025 analysis, and businesses lost an average of nearly $500,000 per deepfake-related incident in 2024.
Most of the conversation around synthetic speech detection focuses on model accuracy. That's the wrong frame. The real question is where detection runs in your stack. A best-in-class detector that sits behind three vendor hops is slower, less reliable, and more expensive than a good detector running natively on the same infrastructure as your voice pipeline. This article walks through how detection actually works, then explains why the platform you build on determines whether detection is a feature or a liability.
Detection accuracy gets all the academic attention, but in production, latency is what decides whether detection prevents fraud or just documents it after the fact.
When detection runs through a separate vendor stack, you add 30 to 150 milliseconds of latency per call. Audio leaves your voice provider, travels to a detection API, gets scored, and travels back. By the time the result lands, the caller has already authenticated, requested a wire transfer, or approved an account change. Detection becomes a forensic tool, not a defense.
The other production constraints, codec mismatch, model drift, telephony artifacts, all compound this problem. Detectors are sensitive to the spectral characteristics of the audio they receive. Telephony codecs strip high-frequency information that detection models rely on. Background noise, cross-talk, and variable microphone quality introduce artifacts that overlap with the spectral signatures of synthetic speech. New TTS models appear constantly, so any static system starts drifting from day one.
These aren't separate problems. They're symptoms of the same root cause: detection only works reliably when it runs on the same infrastructure layer as your voice pipeline. Bolt detection onto a fragmented stack and every codec hop, every vendor handoff, every routing decision degrades the signal the detector needs.
Synthetic speech detection approaches generally fall into three categories: handcrafted acoustic features, end-to-end deep learning, and self-supervised learning (SSL) front-ends. Each has distinct tradeoffs for production deployment.
Handcrafted feature analysis relies on extracting engineered representations like Mel-frequency cepstral coefficients (MFCCs), linear frequency cepstral coefficients (LFCCs), or constant-Q transform (CQT) spectrograms. These features are computationally lightweight and interpretable, but they require extensive tuning and struggle to capture the nuanced artifacts left by modern neural vocoders.
End-to-end deep learning models like AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks) process raw waveforms directly through graph neural networks that model the relationship between spectral and temporal domains. AASIST's architecture learns to detect spoofing cues without relying on predefined features, but it can overfit to the specific attacks in its training data.
Self-supervised learning front-ends represent the current state of the art. Models like wav2vec 2.0, WavLM, and HuBERT, originally trained for speech recognition on massive unlabeled datasets, have proven remarkably effective as feature extractors for deepfake detection. Tak et al. (2022) demonstrated that pairing a fine-tuned wav2vec 2.0 front-end with the AASIST back-end achieved an equal error rate (EER) of just 0.82% on the ASVspoof 2021 Logical Access dataset, an improvement of almost 90% relative to the baseline system. Self-supervised models learn acoustic structure from raw audio at scale, which generalizes far better than features tied to specific synthesis artifacts.
| Method | Input | Strengths | Weaknesses | Best for |
|---|---|---|---|---|
| Handcrafted features (MFCC, LFCC, CQT) | Extracted coefficients | Low compute, interpretable | Brittle against novel attacks | Edge devices, fast pre-screening |
| End-to-end DNN (AASIST, RawNet2) | Raw waveform | No manual feature engineering | Can overfit to known attacks | Research baselines, controlled environments |
| SSL front-ends (wav2vec 2.0, WavLM) | Raw waveform | Strong generalization, state-of-the-art EER | Higher compute, requires fine-tuning | Production detection pipelines |
| Ensemble/fusion | Multiple inputs | Combines complementary signals | Increased latency and complexity | High-stakes verification, forensic analysis |
The ASVspoof challenge series remains the primary benchmark for detection research. ASVspoof 5 drew submissions from 53 teams in 2024 and confirmed a critical gap: many solutions performed well on known attacks but degraded significantly under adversarial filtering and neural codec compression. Adversarial techniques like Malafide and Malacopula were specifically designed to fool detection systems, and they did. The standard metric is equal error rate (EER), which balances false acceptance (letting synthetic speech through) and false rejection (flagging real speech as fake).
For teams measuring audio quality in their own TTS pipelines, UTMOS provides an automated proxy for subjective Mean Opinion Score ratings on a 1-to-5 scale and has become a standard reference for evaluating speech naturalness.
The takeaway from ASVspoof 5 isn't that detection is solved. It's that the gap between lab performance and production performance is widening. Codec-induced distortion alone can move EER by several percentage points, and that gap is exactly where Frankenstack architectures lose.
Here's what a typical voice AI deployment looks like today: orchestration from one vendor (say, Vapi or Retell), TTS from another (ElevenLabs), STT from a third (Deepgram), telephony from a fourth (Twilio), and detection bolted on as a fifth. Every layer is a separate API, with its own auth, its own SLAs, and its own contribution to total latency.
Detection on this kind of stack means audio gets transcoded, packetized, and shipped between vendors before the detector ever sees it. By then, the spectral signal the detector depends on has already been degraded by the audio path itself. And the detection result, when it finally returns, has to be routed back into the call flow through whatever orchestration layer ties the stack together.
Telnyx unifies these layers. Carrier-grade voice, real-time media streaming, programmable call control, and AI inference all run on the same private global IP network, with GPUs colocated next to telephony Points of Presence. That isn't a marketing claim about integration. It's a physical fact about where the compute lives. When detection runs on this kind of architecture, audio doesn't leave the network to get scored. It runs inline, in parallel with STT, with no inter-vendor hop.
There's a deeper advantage that Frankenstack architectures simply can't replicate: detection can run at the SIP and carrier layer, before audio ever reaches the LLM or even the application logic.
Telnyx is a licensed telecom provider in 30+ countries with PSTN reach into 100+. That means the same platform that terminates the call can score the audio for synthetic speech artifacts at the network edge, then route the call accordingly. Suspicious calls can be flagged before they consume application resources. Trusted calls can flow through with no added latency. None of that is possible if your telephony is one vendor and your AI is another.
This is the difference between detection as a feature and detection as infrastructure. Carrier-level detection isn't an API call. It's a property of the network the call is already on.
AI agents have a different tolerance for latency than human-built workflows. A human reviewing a flagged transaction can wait 500 milliseconds for a detection result. An agent making a real-time decision in a live conversation cannot.
If a voice agent runs on a Frankenstack and detection lands 500 milliseconds after the synthetic caller starts speaking, the agent has already begun responding. It has already committed to a conversational path. It is, functionally, running blind. The detection result arrives too late to change the agent's first action, which is often the most consequential one.
On unified infrastructure, detection runs in parallel with STT on the same hardware. There is no inter-vendor round trip. The agent gets the detection signal at the same time it gets the transcript, which means it can shape its very first response based on whether the caller is real. Step-up verification, escalation to a human, or a different conversational policy can all be triggered before the agent has committed to anything.
Agents don't optimize for "actionable insights." They optimize for trusted, low-latency signals at decision time. Detection has to be one of those signals, and it has to live where the agent lives.
For teams building voice AI agents that handle sensitive transactions like insurance claims, financial account changes, or identity verification, detection should be a first-class signal in the agent's decision logic, not a separate vendor it consults.
A practical architecture on Telnyx looks like this: a caller reaches the voice AI agent, audio streams over WebSockets, and a detection module scores each segment in parallel with the speech-to-text pipeline running on the same colocated GPU infrastructure. If the detection score crosses a configurable threshold, the agent escalates, requiring out-of-band verification before proceeding. Call routing and agent behavior are already controlled through programmable voice workflows, so detection becomes one more signal feeding the same control plane, not a parallel system to integrate.
Production systems should log detection confidence scores, capture metadata about codec and channel conditions, and feed that data back into model retraining loops. Attackers iterate. Detection systems need to iterate faster, and that's much easier when the data, the models, and the call flow all live on one platform.
The detection field is moving quickly. Foundation models are showing stronger generalization capabilities than task-specific architectures, largely due to the scale and diversity of their pretraining data. Few-shot fine-tuning is emerging as a practical path to adapting detection systems for specific deployment contexts, such as a particular contact center's caller population or telephony setup, without requiring massive labeled datasets.
On the attack side, adversarial techniques will continue to improve. ASVspoof 5 made clear that codec-aware attacks and adversarial filtering can substantially degrade even strong detection systems. Teams deploying detection in production should plan for continuous model updates and treat detection accuracy as a metric that requires ongoing investment, not a one-time benchmark to hit.
The platform decision sets the ceiling for everything else. Detection capability starts with the network, and the gap between integrated and bolted-on detection only widens as attacks get more sophisticated.
Telnyx owns the infrastructure layers where detection matters most. Carrier-grade voice on a private global IP network, real-time WebSocket media streaming, programmable call control, and colocated GPU inference aren't add-ons to a fragmented platform. They're the foundation. That's why detection on Telnyx runs 100 to 300 milliseconds faster than detection bolted onto a typical Frankenstack, why it sees cleaner audio, and why AI agents built on Telnyx can act on detection signals in the same turn they receive them.
That's also why agents choose it. Talk to our team to see how Telnyx can power your voice AI stack.
Can I run synthetic speech detection on a Frankenstack? Yes, but with measurable penalties. Routing audio through separate orchestration, telephony, and detection vendors typically adds 100 to 300 milliseconds of round-trip latency to the detection result, plus codec degradation at every transcoding hop. For post-call review or compliance logging, that's acceptable. For real-time fraud prevention or AI agent decision-making, it usually isn't.
What's the latency difference between integrated and bolted-on detection? Bolted-on detection adds roughly 30 to 150 milliseconds per inter-vendor hop, depending on geography, codec, and API design. A typical Frankenstack with three or four vendors in the audio path can push total detection latency past 300 milliseconds. Detection running on the same infrastructure as the voice pipeline, with GPUs colocated at the telephony Point of Presence, runs in parallel with STT and adds no additional round-trip time.
Where in the stack should detection live? Ideally, at the carrier layer, so suspicious audio can be flagged before it consumes application resources, and in parallel with STT, so AI agents get the detection signal at the same time they get the transcript. Both placements require that telephony and AI inference run on the same network. Otherwise, you're choosing between detection that's late and detection that's even later.
Does audio quality affect detection accuracy? Significantly. Detectors are sensitive to spectral characteristics, and lossy codecs strip exactly the high-frequency information detectors rely on. Carrier-grade voice with HD codecs produces measurably better detection performance than narrow-band audio from legacy SIP trunks.
What happens when new TTS models appear? Static detectors degrade. Production systems need continuous retraining loops fed by real call data, which is significantly easier when the detection model, the audio path, and the agent logic all live on one platform that you can update together.
Related articles