A practical deepfake guide: How they’re made, how to spot them, and how to protect voice workflows and customers.
In January 2024, a finance worker in the Hong Kong office of engineering firm Arup joined a video call with the CFO and several colleagues. Every person on the call was a generative AI puppet. Over the course of a single day, the employee authorized 15 wire transfers totaling about $25.6 million before checking with the U.K. head office and realizing none of the meeting was real.
Arup is the headline. The underlying pattern is now routine. Fraud teams and contact centers see deepfake impersonation attempts every day, across voice, video, and email. AI agents are increasingly the ones placing and receiving those calls, on both sides of the line. That changes the defense problem. The question is no longer "how do we train staff to spot a fake voice." It is "what infrastructure is our agent calling on, and can the other end actually verify it?"
This guide covers what deepfakes are, where they hit customer workflows, why detection alone is not enough, and where the carrier layer actually changes the math.
Related articles
A deepfake is synthetic media generated or altered by AI to impersonate a real person, real event, or real document. The categories that matter for fraud:
The barrier to producing these has collapsed. A single LinkedIn video clip is enough source material for an attacker to attempt CEO fraud—the U.S. Federal Communications Commission notes that AI tools can clone a human voice from a short audio sample, and security research widely documents voice models trained on as little as three seconds of source audio.
Numbers vary by methodology. The direction is consistent across regulators, financial services, and the research community.
Three patterns stand out. Voice is the fastest-growing modality. The dollar losses concentrate in financial services, insurance, and contact centers. And defensive maturity has not caught up: in the same IRONSCALES study, only 8.4% of organizations scored above 80% on simulated detection exercises and the average score was 44%, even though 99% of security leaders said they were confident in their defenses.
The exposure for most enterprises is not nation-state adversaries. It is everyday seams in customer-facing operations. Five common attack surfaces:
| Attack surface | What it looks like | Typical objective | What actually helps |
|---|---|---|---|
| Contact center inbound | Caller uses a cloned voice plus stolen personal data | Account takeover, password reset, SIM swap | Carrier-level caller ID attestation, recording and media streaming for analysis, AI voice detection |
| Outbound voice AI agent | Fraudster clones a customer's voice from a public source | Defeat voiceprint authentication | Liveness checks, multi-factor step-up, knowledge-based fallback |
| Executive impersonation | Voice or video clone of a CFO or CEO on a finance call | Wire transfer fraud, vendor redirection | Out-of-band verification on a known channel, hard caps on single-approver transfers |
| Identity onboarding | AI-generated selfie video and synthetic documents during KYC | Mule account creation, synthetic identity loans | Document forensics, behavioral biometrics, layered identity verification |
| Robocall impersonation | AI voice mimicking a public figure or family member | Disinformation, "grandparent" scams, voter suppression | Caller ID authentication, carrier-level fraud labeling, in-network spam detection |
Pindrop has flagged the contact center specifically, projecting that retail contact center fraud could reach one fraudulent call in every 56 in 2025 and that overall contact center fraud exposure could approach $44.5 billion. That is the operational reality behind the headline numbers.
Two structural issues make this category different from earlier fraud waves.
Humans can no longer reliably tell. Academic research published in Computers in Human Behavior Reports found average human deepfake detection accuracy of about 55.54% across modalities, barely better than a coin flip. A separate iProov study reported that only 0.1% of participants correctly identified all real and synthetic media in a controlled test. "Train people to spot fakes" is not a defense by itself.
Machine detection is also brittle in production. Detectors are trained on artifacts that compression, codecs, and packet loss strip away. A World Economic Forum analysis cited industry data showing that defensive AI detection tool effectiveness drops by 45 to 50% against real-world deepfakes compared with controlled lab conditions.
If the plan is "we will hear it when it comes through," the data says you probably will not. The plan has to be infrastructure-first, with detection as a layer on top.
Most public conversation about deepfakes focuses on the model, the watermark, or the detector. The infrastructure underneath gets less attention, and it is where the real defense lives.
The internet was not built to verify identity. Anyone can claim any identity on any digital channel. The telephone network was built differently. Calls pass through licensed carriers, regulated routing, and identity attestation frameworks like STIR/SHAKEN. When an AI agent calls a human, the receiving network does not care what your application's auth token says. It cares what the originating carrier signed.
The FCC moved on this in February 2024. The Commission issued a Declaratory Ruling clarifying that AI-generated voices fall within the TCPA's existing restriction on "artificial or prerecorded" voice calls, confirming that state attorneys general can pursue illegal AI robocalls under TCPA authority. Carrier-level signing is now the baseline expectation for any legitimate AI agent making outbound calls.
That signing is a carrier-layer function, not an application-layer one. If your voice AI platform resells PSTN access from another provider, your calls inherit whatever attestation that upstream provider gives them. Application-layer "trust" is invisible to the receiving network. The receiving carrier evaluates the originating carrier's STIR/SHAKEN attestation and sets the verification status that downstream analytics and devices use to decide whether the call rings, gets labeled "Spam Likely," or gets blocked.
For agent operators, the consequence is direct. If you want your AI agent's calls to be trusted by the human on the other end, the carrier layer has to do the verifying.
Most voice AI today is built as a Frankenstack: four to six vendors stitched together to handle a single call. Telephony from one provider. Speech-to-text from another. An LLM from a third. Text-to-speech from a fourth. Orchestration glue from a fifth. Each boundary adds 30 to 80 milliseconds of network overhead and a separate failure domain. When the agent breaks at 2am, the customer becomes the debugger, filing tickets with every vendor while each one points at the next.
For deepfake defense, the Frankenstack has a specific weakness. STIR/SHAKEN attestation lives at the telephony layer. The STT, LLM, and TTS vendors never see it. The orchestration layer cannot enforce it. If the receiving network downgrades the call to a B or C attestation because the originating provider is a reseller, no amount of clever prompt engineering at the application layer fixes it. The call gets flagged. The customer answers in a hurry, and the deepfake on the other end is now competing against a fragmented defense.
This is why the carrier layer matters for AI agents in a way it did not matter for human-operated call centers. A human can recover from a "Spam Likely" label by calling back and identifying themselves. An agent cannot. If the call does not ring, the agent has no second move.
Effective deepfake defense combines four layers. None of them are optional in production.
Treat voiceprints alone as insufficient authentication. For any sensitive action, step up to a second factor on a separate channel, a one-time passcode to a registered device, or a callback to a number on file. Treat any single-channel instruction, whether voice, video, or email, as unverified by default.
On the outbound side, confirm your calls earn A-level STIR/SHAKEN attestation. Calls signed A-level by an originating carrier—meaning the carrier has verified both the caller and the caller's right to use the number—carry the highest trust level the receiving network can grant. Calls routed through a reseller often receive a lower attestation, and a corresponding hit to answer rates, because the signing provider may not have direct knowledge of the caller or number assignment. Working with a Tier-1 carrier on a private IP network puts the signing in the right place.
You cannot detect what you cannot see. Programmable call recording, real-time media streaming, and structured call events are how fraud teams get the signal to score risk, score audio for synthetic markers, and hand suspicious calls to a human reviewer. This requires a voice stack that exposes the full call lifecycle programmatically, not a black-box CPaaS that hands you a transcript after the fact. Carrier-grade programmable Voice API and call control on a network you can observe is the foundation any deepfake detection layer sits on.
Voice AI agents are a fraud target and a fraud control at the same time. Designed well, they enforce multi-factor verification on every call, never skip a step under social pressure, and escalate cleanly to a human when risk thresholds trip. Designed poorly, they become an attractive surface for prompt injection and impersonation. Patterns that hold up in production:
For teams building this, the Telnyx AI Voice Agent platform walks through model selection, multi-agent handoffs, and the webhook patterns that make these workflows concrete.
The Arup loss did not happen because the technology was unstoppable. It happened because a single employee was able to push 15 wire transfers across one day without a real out-of-band check. The fixes are mundane but effective:
These are process controls, not platform controls. They are also the ones that broke the Arup pattern in the firms that did not lose money.
For VPs of customer experience, contact center directors, CIOs, and risk leaders translating the data above into a plan:
None of these steps wait for a perfect detector. They require a voice stack you can program against and a fraud process that does not collapse under social engineering.
Deepfakes are not slowing down. Neither is the regulatory and customer expectation that you will catch them. The organizations that come out of this in good shape treat voice as programmable infrastructure they own and observe, not a black box that delivers minutes.
Telnyx runs telephony, STT, LLM routing, TTS, and Voice AI agents on the same carrier network. One platform. One vendor relationship. No inter-provider hops. A-level STIR/SHAKEN attestation, SOC 2 Type II, and GPUs co-located with our telecom points of presence. That combination is what makes carrier-signed identity, real-time deepfake detection, layered verification, and clean human handoff possible in production, not in a demo.
Ready to harden your voice workflows against synthetic media? Talk to our team about deploying Voice AI on infrastructure built for agents, or sign up for free and make your first call in under five minutes.