Inference

Inference Benchmark: Which Latency Metric Should You Optimize For?

A head-to-head latency benchmark of three leading inference providers across 540 streamed requests.

By Sonam Gupta, PhD

A head-to-head latency benchmark of Telnyx, Together.ai, and Fireworks.ai across 540 streamed requests on three frontier open-weight models.

The Takeaway

We ran 540 streamed chat completions across three inference providers (Telnyx, Together.ai, and Fireworks.ai) on three open-weight models (Kimi K2.6, GLM-5.1, and MiniMax-M2.7) from a single US-region host. Here's what matters:

  • The metric that matters depends on what you're building. For voice AI and real-time applications, Time-to-first-token (TTFT) determines whether the experience works. For batch and agentic workloads, E2E latency and throughput determine cost and speed. We benchmarked both.
  • First token isn't the finish line. Fireworks consistently delivers the fastest time-to-first-token on Kimi K2.6 and GLM-5.1, but Telnyx finishes faster on end-to-end latency for GLM-5.1 at long-output profiles and dominates MiniMax-M2.7 across every metric.
  • MiniMax-M2.7 runs 3-6x faster on Telnyx. On long-output workloads, Telnyx completes in 8-11 seconds. Together takes 36-50 seconds. Throughput: 125-170 tok/s vs 27-42 tok/s.
  • Together's performance is the most volatile. 15 outlier cells where single-run maximums exceeded 5x the median, including a 143-second mid-stream stall on GLM-5.1. Telnyx had 4. Fireworks had 3.
  • FP8 beats FP4 on throughput. Together runs FP4-quantized GLM and MiniMax. Telnyx runs FP8. Our FP8 delivers higher throughput than their FP4 on both models, higher precision, faster output.

TTFT vs E2E: Different Metrics for Different Workloads

Time-to-first-token (TTFT) is the most commonly cited inference benchmark. For some workloads like voice AI, or real-time agents, it's the right metric. For others, batch processing, agentic chains, end-to-end latency (E2E) and throughput matter more. The question isn't which metric is better. It's which metric maps to what you're building.

Our benchmark found a consistent pattern: providers that win on TTFT don't always win on end-to-end latency (E2E).

The clearest example: GLM-5.1 at 10k input, 1k output.

Provider TTFT (p50) E2E (p50) Throughput
Fireworks 1,672 ms 40,156 ms 31.9 tok/s
Together 1,472 ms 27,328 ms 57.4 tok/s
Telnyx 1,346 ms 15,946 ms 83.4 tok/s

Fireworks delivers the first token in 1.7 seconds. But the full response takes over 40 seconds. Telnyx delivers the full response in under 16 seconds, 2.5x faster than Fireworks, 1.7x faster than Together.

If you're building a real-time product, your users don't experience "first token." They experience the full answer. E2E is the metric that maps to user experience. Throughput is the metric that maps to cost-per-token at scale.

When evaluating inference providers, ask:

  • What's the E2E latency at my expected input/output sizes?
  • What's the effective throughput, not just the time to first token?
  • How stable is the distribution? A fast p50 with a 5x tail isn't "fast" in production.

Voice AI: Why TTFT Is the Metric That Matters

Voice AI is the clearest example of why TTFT matters. When a user speaks to an agent, every millisecond of first-token delay is dead air. The response doesn't stream in progressively like a chatbot, the user is waiting for the agent to start talking.

That's why Kimi K2.6 is the model we recommend for voice and real-time applications. Its non-reasoning mode stays highly intelligent while delivering lower TTFT than GLM-5.1. If you're building voice AI, Kimi K2.6 on Telnyx is the right tool.

Model-by-Model Breakdown

MiniMax-M2.7 — Telnyx showcases end-to-end latency

This is where the gap is widest. Telnyx wins on E2E latency at every single profile, short and long output, small and large context.

Long-output workloads (1k output target):

Profile Telnyx E2E Together E2E Fireworks E2E Telnyx Throughput Together Throughput
1k input, 1k output 8,331 ms 36,362 ms 11,453 ms 152 tok/s 33 tok/s
10k input, 1k output 8,990 ms 41,094 ms 10,604 ms 145 tok/s 29 tok/s
100k input, 1k output 11,065 ms 49,838 ms 13,924 ms 124 tok/s 27 tok/s

Telnyx completes MiniMax-M2.7 long-output requests 3-6x faster than Together and slightly faster than Fireworks. At 100k input, Together takes nearly a full minute; Telnyx finishes in 11 seconds.

Short-output workloads: Same story. Telnyx E2E ranges from 1.2-2.3 seconds. Together is 3-5.6 seconds. Fireworks is 1.7-2.9 seconds.

The throughput gap: 125-170 tok/s on Telnyx vs 27-42 tok/s on Together. Together's FP4 quantization doesn't compensate — their throughput is a fraction of Telnyx's FP8.

Verdict: If you're running MiniMax-M2.7, the provider choice isn't close. Telnyx is faster, more consistent, and delivers 3-6x the throughput.


GLM-5.1 — Fastest Throughput on Telnyx

GLM-5.1 tells the "TTFT vs E2E" story best.

Fireworks is consistently the fastest to first token on GLM-5.1 at short contexts. But that early lead evaporates on longer outputs because Fireworks' effective throughput is dramatically lower.

Throughput comparison (tok/s, p50):

Profile Telnyx Together Fireworks
1k in, 100 out 109 81 44
1k in, 1k out 94 62 36
10k in, 100 out 113 89 51
10k in, 1k out 83 57 32
100k in, 100 out 84 71 59
100k in, 1k out 82 53 39

Telnyx delivers 81-113 tok/s on GLM-5.1 vs 32-59 tok/s on Fireworks. That's roughly 2x the throughput at every profile. For workloads generating longer outputs, this compounds into massive E2E differences:

  • 10k input, 1k output: Telnyx 15.9s vs Fireworks 40.2s
  • 100k input, 1k output: Telnyx 16.8s vs Fireworks 34.0s

Verdict: Fireworks may give you the first token faster, but Telnyx gives you the full answer faster, by a factor of 2-2.5x on production-length outputs.


Kimi K2.6 — Built for Voice and Real-Time

Kimi K2.6 is the most evenly matched. Fireworks leads TTFT consistently. E2E is closer:

Profile Telnyx E2E Together E2E Fireworks E2E
1k in, 100 out 1,754 ms 1,901 ms 1,242 ms
1k in, 1k out 10,212 ms 28,304 ms 11,026 ms
10k in, 1k out 10,878 ms 14,458 ms 9,582 ms
100k in, 1k out 13,741 ms 23,960 ms 12,602 ms

Fireworks has a slight edge on short-output E2E. But on long-output, the gap between Telnyx and Fireworks is small (within 10-15%), while Together falls significantly behind.

Throughput is competitive across all three providers on Kimi, with Telnyx and Fireworks trading the lead depending on profile.

Verdict: Kimi K2.6 is the model to reach for when you're building voice agents or real-time applications. Its non-reasoning mode is still highly intelligent, and it delivers lower TTFT than GLM-5.1, which is the metric that matters most when your users are waiting for an agent to speak. For voice AI, the TTFT advantage plus regional availability and data sovereignty make Telnyx the clear choice.


The Reliability Question

Latency averages tell one story. Tail behavior tells another.

We flagged every cell where a single run exceeded 5x the cell's median:

Provider Outlier cells (max > 5x median) Worst single event
Together 15 206-second mid-stream stall (GLM-5.1)
Telnyx 4 36.7s E2E on MiniMax 100k input (median: 2.3s)
Fireworks 3 12.1s TTFT on Kimi 100k input (median: 1.2s)

The Together GLM-5.1 event: On a 100k-input, 1k-output request, the stream produced 423 chunks over 194 seconds. Then a 143-second gap appeared between chunks 363 and 364, after which streaming resumed normally. This wasn't a connection issue. Data flowed on both sides of the gap. It was a mid-stream stall inside Together's infrastructure.

For a chatbot, a 143-second pause is a broken experience. For an agent making sequential LLM calls, it's a cascading delay. For a voice AI pipeline, it's a dropped call.

Also notable: Together's FP4 quantization was expected to deliver throughput advantages over FP8. It didn't. On both GLM-5.1 and MiniMax-M2.7, Together's FP4 delivered lower throughput than Telnyx's FP8.


What This Means for Provider Choice

If you care about... Choose... Why
MiniMax-M2.7 performance Telnyx 3-6x faster E2E, 3-6x throughput vs Together
GLM-5.1 throughput Telnyx 2x throughput advantage vs Fireworks at all profiles
Voice AI and real-time Telnyx Kimi K2.6 has lowest TTFT on our platform + regional availability + data sovereignty
Production reliability Telnyx or Fireworks Together had 15 outlier cells vs 4 and 3
Long-output workloads Telnyx TTFT advantage doesn't carry through to E2E on competitors
Regional availability Telnyx Serverless in US, EU, APAC (Dubai + São Paulo coming)
Data sovereignty Telnyx In-region compute by default; competitors are US-concentrated
Kimi K2.6 TTFT vs E2E Fireworks Fireworks leads raw TTFT, Telnyx is within 10-15% on E2E; voice AI ecosystem tilts toward Telnyx

Methodology

  • 540 streamed chat completions across 3 providers × 3 models × 6 prompt profiles × 10 runs per cell (538 successful)
  • Models: Kimi K2.6, GLM-5.1, MiniMax-M2.7 (Telnyx FP8, Together FP4, Fireworks unverified)
  • Prompt profiles: Literary analysis tasks over Moby-Dick excerpts (public domain). Three input sizes (~1k, ~10k, ~100k tokens) × two output targets (~100, ~1k tokens)
  • Region: US. EU and APAC benchmarks coming soon.
  • Controls: Streaming mode, temperature 0.0, sequential round-robin, 5-request warm-up per provider/model pair
  • Metrics: TTFT (ms), E2E latency (ms), effective throughput (tok/s)
  • n = 10 is pilot-scale. p95/p99 are not reported. Results are directional, not definitive.
  • Reasoning enabled on all models by default. A reasoning-disabled follow-up is planned.
  • Full data and methodology available for reproducibility.

This benchmark was conducted on April 23, 2026. Results reflect provider performance at that time. Inference infrastructure changes frequently, we recommend running your own benchmarks for production decisions. Raw data and methodology are available on request.

Share on Social