Inference

Inference Benchmark: Which Latency Metric Should You Optimize For?

A head-to-head latency benchmark of three leading inference providers across 540 streamed requests.

By Sonam Gupta, PhD

A head-to-head latency benchmark of Telnyx, Together.ai, and Fireworks.ai across 540 streamed requests on three frontier open-weight models.

The Takeaway

We ran 540 streamed chat completions across three inference providers (Telnyx, Together.ai, and Fireworks.ai) on three open-weight models (Kimi K2.6, GLM-5.1, and MiniMax-M2.7) from a single US-region host. Here's what matters:

  • The metric that matters depends on what you're building. For voice AI and real-time applications, Time-to-first-token (TTFT) determines whether the experience works. For batch and agentic workloads, E2E latency and throughput determine cost and speed. We benchmarked both.
  • First token isn't the finish line. Fireworks consistently delivers the fastest time-to-first-token on Kimi K2.6 and GLM-5.1, but Telnyx finishes faster on end-to-end latency for GLM-5.1 at long-output profiles and dominates MiniMax-M2.7 across every metric.
  • MiniMax-M2.7 runs 3-6x faster on Telnyx. On long-output workloads, Telnyx completes in 8-11 seconds. Together takes 36-50 seconds. Throughput: 125-170 tok/s vs 27-42 tok/s.
  • Together's performance is the most volatile. 15 outlier cells where single-run maximums exceeded 5x the median, including a 143-second mid-stream stall on GLM-5.1. Telnyx had 4. Fireworks had 3.
  • FP8 beats FP4 on throughput. Together runs FP4-quantized GLM and MiniMax. Telnyx runs FP8. Our FP8 delivers higher throughput than their FP4 on both models, higher precision, faster output.

TTFT vs E2E: Different Metrics for Different Workloads

Time-to-first-token (TTFT) is the most commonly cited inference benchmark. For some workloads like voice AI, or real-time agents, it's the right metric. For others, batch processing, agentic chains, end-to-end latency (E2E) and throughput matter more. The question isn't which metric is better. It's which metric maps to what you're building.

Our benchmark found a consistent pattern: providers that win on TTFT don't always win on end-to-end latency (E2E).

The clearest example: GLM-5.1 at 10k input, 1k output.

ProviderTTFT (p50)E2E (p50)Throughput
Fireworks1,672 ms40,156 ms31.9 tok/s
Together1,472 ms27,328 ms57.4 tok/s
Telnyx1,346 ms15,946 ms83.4 tok/s

Fireworks delivers the first token in 1.7 seconds. But the full response takes over 40 seconds. Telnyx delivers the full response in under 16 seconds, 2.5x faster than Fireworks, 1.7x faster than Together.

If you're building a real-time product, your users don't experience "first token." They experience the full answer. E2E is the metric that maps to user experience. Throughput is the metric that maps to cost-per-token at scale.

When evaluating inference providers, ask:

  • What's the E2E latency at my expected input/output sizes?
  • What's the effective throughput, not just the time to first token?
  • How stable is the distribution? A fast p50 with a 5x tail isn't "fast" in production.

Voice AI: Why TTFT Is the Metric That Matters

Voice AI is the clearest example of why TTFT matters. When a user speaks to an agent, every millisecond of first-token delay is dead air. The response doesn't stream in progressively like a chatbot, the user is waiting for the agent to start talking.

That's why Kimi K2.6 is the model we recommend for voice and real-time applications. Its non-reasoning mode stays highly intelligent while delivering lower TTFT than GLM-5.1. If you're building voice AI, Kimi K2.6 on Telnyx is the right tool.

Model-by-Model Breakdown

MiniMax-M2.7 — Telnyx showcases end-to-end latency

This is where the gap is widest. Telnyx wins on E2E latency at every single profile, short and long output, small and large context.

Long-output workloads (1k output target):

ProfileTelnyx E2ETogether E2EFireworks E2ETelnyx ThroughputTogether Throughput
1k input, 1k output8,331 ms36,362 ms11,453 ms152 tok/s33 tok/s
10k input, 1k output8,990 ms41,094 ms10,604 ms145 tok/s29 tok/s
100k input, 1k output11,065 ms49,838 ms13,924 ms124 tok/s27 tok/s

Telnyx completes MiniMax-M2.7 long-output requests 3-6x faster than Together and slightly faster than Fireworks. At 100k input, Together takes nearly a full minute; Telnyx finishes in 11 seconds.

Short-output workloads: Same story. Telnyx E2E ranges from 1.2-2.3 seconds. Together is 3-5.6 seconds. Fireworks is 1.7-2.9 seconds.

The throughput gap: 125-170 tok/s on Telnyx vs 27-42 tok/s on Together. Together's FP4 quantization doesn't compensate — their throughput is a fraction of Telnyx's FP8.

Verdict: If you're running MiniMax-M2.7, the provider choice isn't close. Telnyx is faster, more consistent, and delivers 3-6x the throughput.


GLM-5.1 — Fastest Throughput on Telnyx

GLM-5.1 tells the "TTFT vs E2E" story best.

Fireworks is consistently the fastest to first token on GLM-5.1 at short contexts. But that early lead evaporates on longer outputs because Fireworks' effective throughput is dramatically lower.

Throughput comparison (tok/s, p50):

ProfileTelnyxTogetherFireworks
1k in, 100 out1098144
1k in, 1k out946236
10k in, 100 out1138951
10k in, 1k out835732
100k in, 100 out847159
100k in, 1k out825339

Telnyx delivers 81-113 tok/s on GLM-5.1 vs 32-59 tok/s on Fireworks. That's roughly 2x the throughput at every profile. For workloads generating longer outputs, this compounds into massive E2E differences:

  • 10k input, 1k output: Telnyx 15.9s vs Fireworks 40.2s
  • 100k input, 1k output: Telnyx 16.8s vs Fireworks 34.0s

Verdict: Fireworks may give you the first token faster, but Telnyx gives you the full answer faster, by a factor of 2-2.5x on production-length outputs.


Kimi K2.6 — Built for Voice and Real-Time

Kimi K2.6 is the most evenly matched. Fireworks leads TTFT consistently. E2E is closer:

ProfileTelnyx E2ETogether E2EFireworks E2E
1k in, 100 out1,754 ms1,901 ms1,242 ms
1k in, 1k out10,212 ms28,304 ms11,026 ms
10k in, 1k out10,878 ms14,458 ms9,582 ms
100k in, 1k out13,741 ms23,960 ms12,602 ms

Fireworks has a slight edge on short-output E2E. But on long-output, the gap between Telnyx and Fireworks is small (within 10-15%), while Together falls significantly behind.

Throughput is competitive across all three providers on Kimi, with Telnyx and Fireworks trading the lead depending on profile.

Verdict: Kimi K2.6 is the model to reach for when you're building voice agents or real-time applications. Its non-reasoning mode is still highly intelligent, and it delivers lower TTFT than GLM-5.1, which is the metric that matters most when your users are waiting for an agent to speak. For voice AI, the TTFT advantage plus regional availability and data sovereignty make Telnyx the clear choice.


The Reliability Question

Latency averages tell one story. Tail behavior tells another.

We flagged every cell where a single run exceeded 5x the cell's median:

ProviderOutlier cells (max > 5x median)Worst single event
Together15206-second mid-stream stall (GLM-5.1)
Telnyx436.7s E2E on MiniMax 100k input (median: 2.3s)
Fireworks312.1s TTFT on Kimi 100k input (median: 1.2s)

The Together GLM-5.1 event: On a 100k-input, 1k-output request, the stream produced 423 chunks over 194 seconds. Then a 143-second gap appeared between chunks 363 and 364, after which streaming resumed normally. This wasn't a connection issue. Data flowed on both sides of the gap. It was a mid-stream stall inside Together's infrastructure.

For a chatbot, a 143-second pause is a broken experience. For an agent making sequential LLM calls, it's a cascading delay. For a voice AI pipeline, it's a dropped call.

Also notable: Together's FP4 quantization was expected to deliver throughput advantages over FP8. It didn't. On both GLM-5.1 and MiniMax-M2.7, Together's FP4 delivered lower throughput than Telnyx's FP8.


What This Means for Provider Choice

If you care about...Choose...Why
MiniMax-M2.7 performanceTelnyx3-6x faster E2E, 3-6x throughput vs Together
GLM-5.1 throughputTelnyx2x throughput advantage vs Fireworks at all profiles
Voice AI and real-timeTelnyxKimi K2.6 has lowest TTFT on our platform + regional availability + data sovereignty
Production reliabilityTelnyx or FireworksTogether had 15 outlier cells vs 4 and 3
Long-output workloadsTelnyxTTFT advantage doesn't carry through to E2E on competitors
Regional availabilityTelnyxServerless in US, EU, APAC (Dubai + São Paulo coming)
Data sovereigntyTelnyxIn-region compute by default; competitors are US-concentrated
Kimi K2.6 TTFT vs E2EFireworksFireworks leads raw TTFT, Telnyx is within 10-15% on E2E; voice AI ecosystem tilts toward Telnyx

Methodology

  • 540 streamed chat completions across 3 providers × 3 models × 6 prompt profiles × 10 runs per cell (538 successful)
  • Models: Kimi K2.6, GLM-5.1, MiniMax-M2.7 (Telnyx FP8, Together FP4, Fireworks unverified)
  • Prompt profiles: Literary analysis tasks over Moby-Dick excerpts (public domain). Three input sizes (~1k, ~10k, ~100k tokens) × two output targets (~100, ~1k tokens)
  • Region: US. EU and APAC benchmarks coming soon.
  • Controls: Streaming mode, temperature 0.0, sequential round-robin, 5-request warm-up per provider/model pair
  • Metrics: TTFT (ms), E2E latency (ms), effective throughput (tok/s)
  • n = 10 is pilot-scale. p95/p99 are not reported. Results are directional, not definitive.
  • Reasoning enabled on all models by default. A reasoning-disabled follow-up is planned.
  • Full data and methodology available for reproducibility.

This benchmark was conducted on April 23, 2026. Results reflect provider performance at that time. Inference infrastructure changes frequently, we recommend running your own benchmarks for production decisions. Raw data and methodology are available on request.

Share on Social