Conversational AI

Run Telnyx Voice AI Assistants with Any OpenAI-Compatible LLM

Abishek Sharma
By Abhishek Sharma

Most voice AI platforms lock you into their inference stack. If you already run models on Bedrock or your own GPUs, integrating them usually means rebuilding the whole telephony layer from scratch.

We removed that constraint. You can now run Telnyx Voice AI Assistants on any OpenAI-compatible model endpoint without losing carrier-grade voice performance.

What’s now supported

You can point your AI Assistant at:

  • AWS Bedrock - Use Anthropic, Meta, Cohere, or Mistral models under your existing AWS contract
  • Azure OpenAI Service - Route to Microsoft-hosted GPT models where regional or compliance rules apply
  • Self-hosted inference servers - Run vLLM, sglang, or TGI on your own GPUs
  • Custom fine-tuned models - Use models trained on proprietary data such as support logs, product catalogs, or internal documentation
  • Any OpenAI-compatible endpoint - Platforms like Baseten, Replicate, or together.ai work out of the box

The only requirement: your endpoint must implement the OpenAI Chat Completions API. If it accepts /v1/chat/completions requests, it will work.

How it works

In the Mission Control Portal, create or edit an AI Assistant. Under the Agent tab, enable Use Custom LLM and provide:

  1. Base URL - The public endpoint for your inference server
  2. API Key - Stored securely as an Integration Secret
  3. Model Name - Auto-populated if your endpoint supports /models, otherwise entered manually

Telnyx validates the connection before saving. Once configured, your assistant routes LLM calls to your endpoint instead of Telnyx’s GPU clusters. Voice synthesis, speech recognition, and call control still run on Telnyx infrastructure.

You can test immediately in the portal or deploy to production numbers.

The latency trade-off

Our default architecture delivers sub-second round-trip time by running inference on GPUs co-located with our telephony core. Using an external LLM adds network hops.

If your inference server is near a Telnyx PoP (for example, us-east-1), expect an additional 20-50 ms. Remote or variable endpoints may add 100-300 ms per turn. That’s still faster than most stitched-together voice stacks but no longer guaranteed sub-200 ms.

When response speed is critical, use Telnyx’s built-in inference.

When model control, compliance, or cost take priority, bring your own endpoint.

When this matters

This option fits when:

  • You have existing Bedrock, Azure, or GCP credits and want to use them for inference
  • Inference must stay within a specific region for compliance
  • Your models are fine-tuned on proprietary data that can’t leave your environment
  • You’re running high-volume workloads on self-managed GPU infrastructure

If none of those apply, Telnyx’s built-in LLM library (Llama, Mistral, Gemini, GPT-4) remains the fastest path.

Availability

Custom LLM support is available now for all Telnyx AI Assistant users.

Documentation: developers.telnyx.com/docs/inference/ai-assistants/custom-llm

Whether you’re optimizing for latency, cost, or compliance, you can now choose exactly where your model runs and keep everything else running on Telnyx.

Share on Social

Related articles

Sign up and start building.