We built Telnyx Inference as a platform where developers can easily harness the power of AI with fast, contextualized inference.

In April 2023, Telnyx held its first internal hackathon, Open AprIl. All the projects showcased Telnyx employees’ creativity and drive to innovate, and several showed promise for further development.
While we were excited about so many of the initiatives we saw during Open AprIl, the entire event made us realize we needed to double down on AI as a company. Today, Telnyx VP of Engineering James Whedbee takes us through how Open AprIl sparked the idea for Telnyx’s new Inference product.
During Open AprIl, two things became clear about Telnyx’s approach to projects using artificial intelligence (AI):
1. There was a layer of abstraction missing. As an Open AprIl coordinator, I learned a lot by watching our teams reinvent many of the same building blocks to build their applications on top of OpenAI, adding time to each project.
2. We weren’t in control. Our applications were wholly dependent on OpenAI in terms of availability, latency, new features, and cost.
With that in mind, we started to build Telnyx Inference to provide a platform where developers can stay on the bleeding edge, at scale, to build their AI experiences with both proprietary and open-source models.
Building on top of our GPU infrastructure and Cloud Storage product, our AI team has released several APIs to support retrieval-augmented generation (RAG) pipelines at scale. Telnyx customers—and our engineering teams—can now:
Telnyx Cloud Storage has a simple S3-compatible interface for uploading thousands of documents. Built for RAG pipelines at scale, we can offer AI-enabled document storage and retrieval at a price that’s 10 times less than OpenAI’s recently released Assistants API.
Our GPU network will then run the latest open-source embedding models over your documents, and will provide your unique context to our similarity search and inference APIs within seconds.
Another thing that excited our AI squad was the speed of development of new open-source models. Every single week, there’s something new and exciting to try. We wanted to build a platform that could be easily extended to take advantage of the staggering breadth of open-source models in the market.
While our public API currently provides out-of-the-box support for some of the most popular models like Llama 2 and Mistral, our team is hard at work to provide support for the full range of open-source models to run on Telnyx GPU.
As I said, we built the Telnyx Inference product to address the real-world problems our teams were facing building their own AI applications. By using our own Inference APIs, we can support these use cases with higher efficiency and lower costs, while accelerating the development of our product for others.
Our NOC engineering team has been busy building a customer support chatbot to enhance our customer support experience with AI. The API we built to support a RAG pipeline was directly informed by our team’s needs.
Anyone who’s worked with embedding documents knows there are an overwhelming number of choices to make, all of which could impact your chatbot’s performance:
To help folks get started quickly, our embedding and inference API makes reasonable default choices for everything. All you have to do is upload files to a Telnyx Storage bucket, embed them, and start talking to a language model. When you add new files, they’ll be embedded automatically.
This process works for a first pass, but if you’re as serious about realizing the full potential of your chatbot as our NOC engineering team is, you’ll want control over these decisions. Between our embedding and similarity search APIs, we’ve exposed full control to our customers.
Another request from our team was to implement a custom Intercom loader to improve search results. They noticed the default loaders provided by open-source libraries were insufficient, ultimately creating their own implementation to achieve their desired results. A version of that implementation is now available to all customers via our custom Intercom loader.
As a leading communications provider, we’re very excited to support conversational AI tools, in addition to adding support for our in-house chatbot
During Open AprIl, Telnyx Senior Software Engineer Enzo Piacenza built a conversational AI bot that must reduce its latency to mimic human speech. This project prompted our AI squad to build the necessary APIs to support real-time voice interactions. The team worked to run quantized models on our GPUs to create a bot that could mimic authentic, human-like speech that’s both high quality and blazing fast.
A quantized model is a language model intentionally made to be a little less "precise" to make it smaller and faster. For this reason, quantized models are a popular way to run LLMs on storage- and energy-constrained consumer hardware.
While those are important considerations for conversational AI at Telnyx, they’re not the key constraint. What makes quantized models so powerful for this use case is their latency optimization, with little reduction in quality.
The quantized Mistral 7B Instruct model running on our GPUs generates 120 tokens per second for a batch size of one—making it viable for use in conversational AI tools like Enzo’s.
Things are moving fast in the AI space, and it can be hard to keep up. We’ve only begun to scratch the surface of what’s possible with AI—especially when paired with intuitive APIs for storage and voice.
If you’ve been building your own AI projects this year, we would love to hear from you. Share your project with our Developer Community.
We can’t wait to see what you build on Telnyx.
What is Telnyx Inference? Telnyx Inference refers to running trained AI models inside communications flows to interpret and act on real-time inputs. It powers tasks like speech recognition, intent detection, summarization, and automated responses across voice and messaging.
How does Telnyx Inference work? Audio or text is captured, normalized, and fed to speech-to-text and language models, which return structured intents or text that applications use to route, respond, or trigger tools. For voice, the system streams tokens for low-latency replies and uses text-to-speech to speak back, while call control updates the flow in real time.
What does inference mean in large language models? Inference is the process a trained model uses to generate outputs from new inputs, typically by predicting tokens step by step. It is distinct from training, which adjusts model weights, and it emphasizes latency, quality, and cost during serving.
Can models like ChatGPT act as the inference engine in Telnyx workflows? Yes, many teams wire proprietary or open-source LLMs into their pipelines and treat them as the runtime engine for understanding and response. If you need the model to act on incoming images or video, you can route media through a programmable MMS API so the engine receives a consistent payload to process.
Does message format, like SMS or MMS, change how inference performs? Yes, channel choice affects payload size, media types, and how much the model must parse, which directly influences latency and cost. Larger media and richer content in SMS vs MMS scenarios often require pre-processing and may slow responses compared with short text.
How should I handle images, audio, or video when using inference on messaging? Use MMS messaging for rich media so your system can ingest attachments with the right metadata and hand them to a multimodal model. Apply guardrails like file-size checks, media type validation, and background transcription or captioning to keep responses accurate and fast.
What is the difference between rules of inference and model inference in AI? Rules of inference are formal logic steps for deriving conclusions from premises, while model inference is the statistical process of generating outputs from a trained model. In practice you can combine symbolic rules for validation with LLM outputs, but the mechanisms and guarantees differ.
You can test out Telnyx Inference, currently in beta, by signing up for a free Telnyx account.
Similarity search and custom loaders are available now in our API documentation.
Related articles