Guides and Tutorials

How to choose an LLM API for production

A practical checklist for picking an LLM API you can ship: Performance, cost, safety, and integration.

By Eli Mogul

The LLM API market has fragmented. Two years ago, most production workloads ran on a single vendor. Today, you're picking between OpenAI, Anthropic, Google, xAI, DeepSeek, and a growing list of open-source providers, each with different pricing, context windows, latency profiles, and data-handling policies. Get the choice wrong, and you're either overpaying for capability you don't need or stuck with an integration that can't hold up in production.

This guide walks through what actually matters when you're moving an LLM API from prototype to production: price per token, latency under real load, context window economics, safety controls, and how the API fits into the broader stack you're building, especially if voice is part of that stack.

The other shift worth noting: increasingly, the buyer evaluating these APIs isn't you. It's the agent you're building. Agents read SDK readmes, time first-call latency, and route to the next provider when something doesn't work. Agents make these choices at machine speed, against criteria the marketing page doesn't address.

How the LLM API market shifted in 2025

The market leaders look different than they did 18 months ago. According to Menlo Ventures' 2025 State of Generative AI in the Enterprise report, Anthropic now earns roughly 40% of enterprise LLM spend, up from 24% the year before and 12% in 2023. OpenAI's share fell to 27% from 50% in 2023, while Google climbed from 7% to 21%.

Spending has accelerated even faster than the rankings. Enterprise spending on LLM APIs rose from $3.5 billion in late 2024 to $8.4 billion in the first half of 2025, more than doubling in six months as workloads moved into full production.

For buyers, the practical takeaway is simple: the field is wider, the prices are dropping, and the cost of locking yourself into a single vendor is higher than it used to be. You want an architecture that lets you swap models without rewriting your stack.

What an LLM API actually delivers (and what it doesn't)

An LLM API gives you programmatic access to a hosted large language model. You send tokens in, you get tokens back, and you pay for both. The vendor handles the GPUs, the model weights, the scaling, and (in most cases) the safety filtering.

What an LLM API doesn't give you on its own:

A telephony network. If you want the model to take or place phone calls, you need PSTN connectivity, number provisioning, and STIR/SHAKEN attestation. None of that ships with an LLM API.
Real-time voice plumbing. Token streaming is fast, but voice AI also needs speech-to-text, text-to-speech, turn detection, and a media path that holds together at sub-second latency.
Data sovereignty by default. Most APIs run in a handful of regions. If you need EU or APAC residency, you have to configure it.
Failover. If your provider has an outage, your app goes down unless you've built routing across multiple models.

Vendor-managed APIs handle the inference. Everything around it (the orchestration, the audio, the network, the failover logic) is on you. For a deeper breakdown of when self-hosting is the right call versus an API, see Telnyx's analysis of why self-hosting LLMs fails in production.

The seven things to evaluate before you commit

1. Price per token, and the levers underneath it

Headline rates matter, but the real cost depends on caching, batch discounts, and how your workload distributes between input and output. Most providers now offer at least two of these:

Prompt caching, which stores reused prefixes (system prompts, document context, tool definitions) and serves cached input at roughly 10% of the standard rate. Batch processing, which discounts asynchronous workloads by about 50%. Tiered context pricing, where prompts above 200K tokens jump to a higher rate on some flagship models.

Here's how the current flagship and mid-tier models from the major providers compare, before discounts. Pricing accurate as of May, 2026; verify with each vendor before budgeting.

Provider	Model	Input ($/1M tokens)	Output ($/1M tokens)	Context window
OpenAI	GPT-5.5	$5.00	$30.00	1M
OpenAI	GPT-5.4	$2.50	$15.00	1M
Anthropic	Claude Opus 4.7	$5.00	$25.00	1M
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	1M
Anthropic	Claude Haiku 4.5	$1.00	$5.00	200K
Google	Gemini 3.1 Pro	$2.00	$12.00	2M
Google	Gemini 3 Flash	$0.50	$3.00	1M
DeepSeek	V4 Flash	$0.14	$0.28	1M

Sources: OpenAI API pricing, Anthropic pricing documentation, Google Cloud Vertex AI pricing, and DeepSeek API pricing. Rates shift frequently, so verify before you budget. DeepSeek's aggressive cuts have been a notable pressure point on the market, with the company cutting API prices by more than half in September 2025 when it released its sparse attention model.

Two pricing patterns are worth flagging. First, Google's Gemini 3.1 Pro doubles its input rate above the 200K-token threshold. Second, Anthropic now bundles 1M-token context at flat rates on Opus 4.7 and Sonnet 4.6, which makes long-context budgeting more predictable than it used to be.

One pattern the table doesn't capture: vendors that own their own infrastructure can price below resellers structurally, not promotionally. Telnyx TTS starts at $0.000003/char because the synthesis runs on owned compute rather than reselling someone else's GPUs with margin. The same economics apply to inference: the more layers of the stack a vendor owns, the lower the structural cost floor.

2. Latency, especially if voice is involved

For text chat, response time of a second or two is fine. For voice, it's a deal-breaker. Human conversations naturally flow with pauses of 200 to 500 milliseconds between speakers, and when AI systems exceed that window, the interaction starts to feel broken. Past about 800ms, users repeat themselves or hang up.

Voice AI latency is a budget problem. The total time between when a caller stops speaking and when they hear a response includes endpoint detection, speech-to-text, the LLM call, text-to-speech, and the network path on both ends. If your LLM inference alone takes 600ms because of geographic distance to the model's GPUs, you've already blown the budget.

This is where infrastructure topology matters. The further your audio has to travel to reach the inference layer, the more latency you accumulate, and you can't optimize that away in software. Speed of light in fiber is roughly 5ms per 1,000km. Architecture wins. Telnyx solves this by colocating GPU infrastructure directly adjacent to its global telecom Points of Presence (PoPs), which is the architectural choice we made specifically for real-time voice AI. You can explore the full stack on the Telnyx Inference API page.

3. Context window and what it actually costs

Long context is useful, but it isn't free. Flagship models now stretch to 1M or 2M tokens, which is enough to drop entire codebases or document libraries into a single prompt. Two things to watch:

Some providers charge a higher rate above a threshold (typically 200K tokens). Anthropic now bundles 1M context at flat rates on Opus 4.7 and Sonnet 4.6, which makes long-context budgeting easier. On other providers, the long-context multiplier can quietly double your bill.

If you're doing retrieval-augmented generation (RAG), aggressive retrieval that pushes prompts past the threshold can silently double your bill. Build the cost model before you ship.

4. Safety controls, logging, and auditability

For regulated workflows (financial services, healthcare, contact centers under PCI scope), the model's capability matters less than the controls around it. Look for:

Per-request logging that you can export. Configurable safety filters with categories you can tune. Data-handling commitments, especially around whether your inputs are used for training. Audit trails that survive a compliance review.

If voice is in scope, the compliance picture extends past the model. STIR/SHAKEN attestation, call recording, telecom data residency, and originating-carrier accountability all live at the carrier layer, not in the LLM API. Telnyx is a licensed carrier in 40+ countries and, as the originating carrier on US calls, signs eligible traffic at A-level STIR/SHAKEN, meaning we verify both the caller and the caller's right to use the number. Voice agent compliance is bundled with the network, not sold as a $50K/year add-on.

The major vendors all publish their safety and data policies now, but the details vary, and the defaults aren't always what you want. Read the documentation, don't trust the marketing page.

5. SLAs and rate limits

Production traffic is bursty. A successful product launch can 10x your token volume overnight. Check:

Stated SLA for uptime, and what credits look like if they miss it. Rate limits at your tier, and how quickly you can graduate to a higher tier. Whether long-context requests have separate rate limits (they often do). How the provider handles capacity during peak periods (some queue, some return errors).

For mission-critical applications, the answer is usually to route across two providers, with one as primary and one as failover.

6. Open-source flexibility

Proprietary models from OpenAI and Anthropic dominate the leaderboards, but open-source models are closing the gap and cost a fraction of frontier APIs. DeepSeek's V3 paper describes a 671-billion-parameter open-weight model that achieved frontier-class results on a fraction of the training compute of comparable Western labs.

The case for open-source is straightforward: no per-token lock-in, the ability to fine-tune on your own data, and the option to run inference where your data already lives. The case against is operational: GPUs are expensive, MLOps is hard, and most teams underestimate the work involved. The broader trend is that proprietary models keep getting cheaper too, with Reuters reporting GPT-4o mini at $0.15/$0.60 per million tokens when it launched, which compresses the savings gap.

Cloudflare's Workers AI runs open-source inference at edge cities, which sounds like the right architecture for low-latency voice. But the agent also needs to make a phone call, send an SMS, or verify identity. Workers AI sits next to compute, not next to a carrier network. For voice agents, edge compute matters less than carrier-edge compute: GPUs colocated with the telephony switch, not in a separate cloud region.

A middle path is to use an open-source LLM through a managed API. That's the model behind the Telnyx LLM Library, which gives you access to 20+ open-source models through a single integration, hosted on infrastructure built for low-latency inference. For a deeper look at the model options, see the guide to the 6 best open-source LLMs in 2026.

7. Integration surface and vendor lock-in

How easily can you swap models? The OpenAI API format has become a de facto standard, and most serious vendors now offer OpenAI-compatible endpoints. That matters because it means your application code doesn't need to change when you want to test a new model. xAI's Grok API, for example, is explicitly designed to be compatible with the OpenAI and Anthropic SDKs, so migrating is a matter of generating a new API key and changing a URL.

There's a related question worth asking: how easily can you swap the entire stack? An LLM API is one component in a voice AI pipeline that typically includes a telephony vendor, an STT provider, a TTS vendor, and an orchestration layer. Each one is a vendor boundary, a margin layer, and a separate dashboard to debug. The OpenAI-compatible standard helps with model portability, but it does nothing about the four other vendors you're managing.

Telnyx supports model portability directly: you can run Telnyx Voice AI Assistants with any OpenAI-compatible LLM, whether that's a frontier proprietary model, an open-source model from the LLM Library, or your own self-hosted endpoint. The difference is that the telephony, the inference, the call control, and the orchestration all live on the same platform underneath. One API key, one vendor relationship, one team to call when something breaks.

Cost models that scale (and ones that don't)

A few patterns to budget around:

Cache aggressively. Prompt caching is the single biggest cost lever for most production workloads. Combining prompt caching (about 90% savings on cached input) and batch API (50% off) can compress effective costs by an order of magnitude on cache-heavy workloads.

Right-size the model. Not every request needs the flagship. Route classification, extraction, and summarization to smaller models, and reserve the expensive tier for the requests that genuinely need it. Done well, this cuts spend by 30 to 50% with no quality loss on the routed tasks.

Watch the output-to-input ratio. Output tokens cost 3x to 8x more than input across the major providers. Verbose responses are expensive. Trim them in the prompt.

Account for reasoning tokens. Reasoning models generate internal chain-of-thought tokens that are billed at output rates even though you never see them. Budget for 2x to 5x the visible output volume on complex tasks.

How voice changes the LLM API calculus

If you're building a chatbot, almost any frontier API works. If you're building a Voice AI Agent that handles real phone calls, the constraints tighten.

Latency drives this. Sub-800ms voice-to-voice is the threshold for natural conversation, and the LLM call is only one piece of that budget.

The audio plumbing has to exist. PSTN connectivity, SIP, call control, STIR/SHAKEN, all of it. An LLM API alone gives you tokens, not phone calls.

Concurrency profile is different. Voice traffic is unpredictable and bursty, and the infrastructure has to scale without dropping calls.

There's a fourth constraint that only shows up in production: when the agent goes down at 2am, who actually fixes the call? In a stack assembled from a telephony vendor plus STT plus LLM plus TTS plus orchestrator, the telephony vendor blames the LLM provider, the LLM provider blames the TTS, the TTS vendor blames orchestration, and the customer becomes the debugger. The architecture that eliminates this is the same one that eliminates network hops: one infrastructure, one team, one trace.

This is where most generic LLM APIs hit their limits in voice deployments. Telnyx is the only platform that combines carrier-grade telephony infrastructure with low-latency AI in a single stack. You provision numbers, configure SIP trunks, and run AI Agents on the same platform, without assembling three or four vendors.

What's worth doing before you commit

Before you sign anything beyond a free tier, run the following:

Build a representative workload. Pick five to ten prompts that reflect your real production traffic. Time them on each candidate API.
Measure end-to-end latency from the same region your users live in.
Read the data policy. Specifically the clauses on training, retention, and regional residency.
Confirm rate limits at the tier you're paying for, not at the demo tier.
Model the cost at 10x and 100x your current volume. Pricing tiers shift, and what's cheap at 1M tokens per month may be painful at 100M.

If voice is in scope, run the same evaluation against a stack that already has the telephony layer in place. The savings on integration time are often larger than the savings on token price. As VentureBeat noted in its coverage of the ongoing API price war, token costs are still falling. The durable differentiation isn't the per-token rate, it's the surrounding infrastructure. Most stacks get worse as you scale globally because each region adds vendor hops. Architectures that own the underlying network get better as they expand.

Build voice AI on infrastructure designed for it

"AI Agent Infrastructure chart")

The LLM API is one layer in a much larger system. AI agent infrastructure has three: global communications (carrier network, identity, compliance), agent platform (intelligence, orchestration, voice AI), and edge compute (inference, storage, low-latency response). Most vendors give you one. The model providers give you inference. The telephony vendors give you connectivity. Assembling the three layers across four or five vendors creates the Frankenstack, and it fails three ways: latency compounds across vendor boundaries, reliability degrades when five vendors each at 99.9% uptime compound to 99.5%, and when something breaks at 2am, no one owns the call.

Telnyx unifies all three layers on one infrastructure. Voice is the wedge, not the ceiling. The same platform handles SMS, email, and async agent operations as your stack grows. See the LLM Library or talk to our team about what your AI agent infrastructure should look like.

Share on Social

How the LLM API market shifted in 2025 What an LLM API actually delivers (and what it doesn't)The seven things to evaluate before you commit Cost models that scale (and ones that don't)How voice changes the LLM API calculus What's worth doing before you commit Build voice AI on infrastructure designed for it