TL;DR: ServiceNow's EVA is the most rigorous independent voice agent benchmark available - but until now it only measured self-hosted Pipecat pipelines, the architecture most likely to accumulate latency through vendor hops. We extended EVA to measure hosted voice APIs under the same methodology, ran Telnyx AI Assistants through 150 live conversations, and landed above the current accuracy/experience frontier. Here's what that means and why the architecture matters.
Comparing hosted voice agent platforms has been impossible to do honestly. ElevenLabs publishes their numbers. OpenAI publishes theirs. Deepgram publishes theirs. No shared methodology, no cross-inclusion, no way for anyone outside those companies to verify anything.
EVA is different. ServiceNow built it to evaluate complete, multi-turn voice conversations end-to-end using bot-to-bot audio - two AI systems calling each other over live audio, scored on both task accuracy (EVA-A) and conversational quality (EVA-X) simultaneously. No human annotators. No component-level isolation. No self-reported results.
It's the closest thing the industry has to a neutral referee. The catch: the leaderboard only included homegrown Pipecat systems - not the hosted APIs most production voice agents actually run on.
Pipecat is a self-hosted voice agent framework. You assemble your own STT, LLM, and TTS, wire them through an orchestrator, and run calls through it. It's flexible - and it's also the architecture where latency compounds the fastest.
Every component in a Pipecat pipeline is typically a separate vendor running in a separate cloud. Audio leaves your orchestrator, crosses the public internet to an STT provider, comes back. Text leaves again to an LLM provider, comes back. The LLM output goes to a TTS provider, comes back. Each of those vendor boundaries adds 30–80ms of network overhead before any model runs.
Hosted voice APIs - Telnyx AI Assistants, ElevenLabs Conversational AI, OpenAI Realtime API, Deepgram Voice Agent, Google Gemini Live - collapse those hops. The STT, LLM, and TTS live behind a single endpoint. You connect once, over SIP or WebSocket, and the internal traffic between components never leaves the provider's infrastructure.
Related articles
In Telnyx's case, that infrastructure is our own carrier network, with inference co-located at the same facilities where calls terminate. The network hops Pipecat pipelines fight aren't something we optimize - they're something our architecture doesn't introduce in the first place.
That's the architectural bet we wanted tested against a benchmark we didn't write.
EVA's evaluation pipeline was designed for full visibility into the agent - intermediate states, component timings, direct access to model outputs. Hosted APIs don't expose that surface area, which is why the leaderboard excluded them.
Our engineering team built an extension that lets EVA drive any hosted voice API the same way it drives a Pipecat pipeline - using the same scenarios, the same user simulator, the same scoring on both axes. The contribution is upstream at ServiceNow; any team can now add their hosted provider and get comparable results on the same leaderboard.
Then we ran the thing we built on the thing we built.
We ran Telnyx AI Assistants through EVA's 50 airline rebooking scenarios - three trials each, 150 conversations total. The stack: Kimi-K2.5 as the LLM, Deepgram for STT, Telnyx TTS.

The chart above is the result that matters. Every existing system on EVA's leaderboard sits on a curve where improving accuracy forces a tradeoff on experience, and vice versa. No agent dominates both axes. That tradeoff curve is the Pareto frontier - the best combinations any benchmarked system has achieved so far.
One caveat worth naming: our benchmark setup routes audio through a client-side bridge that isn't present in production, which adds latency we don't see in real deployments. The accuracy and experience scores are valid cross-provider comparisons - the conversational quality that beat the frontier is representative.
Latency numbers aren't production-representative and we're not publishing them here.
The full implementation is at github.com/ServiceNow/eva/issues/35. If you want to benchmark your own hosted voice stack against ours, that's the starting point.
Contact us to get started.