The building blocks for custom, scalable AI Inference
We built Telnyx Inference as a platform where developers can easily harness the power of AI with fast, contextualized inference.
By James Whedbee
In April 2023, Telnyx held its first internal hackathon, Open AprIl. All the projects showcased Telnyx employees’ creativity and drive to innovate, and several showed promise for further development.
While we were excited about so many of the initiatives we saw during Open AprIl, the entire event made us realize we needed to double down on AI as a company. Today, Telnyx VP of Engineering James Whedbee takes us through how Open AprIl sparked the idea for Telnyx’s new Inference product.
You can test out Telnyx Inference, currently in beta, by signing up for a free Telnyx account.
Iterating from our Open AprIl project to create Inference
During Open AprIl, two things became clear about Telnyx’s approach to projects using artificial intelligence (AI):
1. There was a layer of abstraction missing. As an Open AprIl coordinator, I learned a lot by watching our teams reinvent many of the same building blocks to build their applications on top of OpenAI, adding time to each project.
2. We weren’t in control. Our applications were wholly dependent on OpenAI in terms of availability, latency, new features, and cost.
With that in mind, we started to build Telnyx Inference to provide a platform where developers can stay on the bleeding edge, at scale, to build their AI experiences with both proprietary and open-source models.
Building on top of our GPU infrastructure and Cloud Storage product, our AI team has released several APIs to support retrieval-augmented generation (RAG) pipelines at scale. Telnyx customers—and our engineering teams—can now:
- Upload documents through Telnyx Cloud Storage
- Vectorize documents using state-of-the-art open-source embedding models
- Perform similarity searches
- Interact with state-of-the-art open-source LLMs and embedded content.
AI-ready storage at scale
Telnyx Cloud Storage has a simple S3-compatible interface for uploading thousands of documents. Built for RAG pipelines at scale, we can offer AI-enabled document storage and retrieval at a price that’s 10 times less than OpenAI’s recently released Assistants API.
Our GPU network will then run the latest open-source embedding models over your documents, and will provide your unique context to our similarity search and inference APIs within seconds.
Staying on the bleeding edge
Another thing that excited our AI squad was the speed of development of new open-source models. Every single week, there’s something new and exciting to try. We wanted to build a platform that could be easily extended to take advantage of the staggering breadth of open-source models in the market.
While our public API currently provides out-of-the-box support for some of the most popular models like Llama 2 and Mistral, our team is hard at work to provide support for the full range of open-source models to run on Telnyx GPU.
Drinking our own champagne
As I said, we built the Telnyx Inference product to address the real-world problems our teams were facing building their own AI applications. By using our own Inference APIs, we can support these use cases with higher efficiency and lower costs, while accelerating the development of our product for others.
Optimizing a RAG pipeline for an AI chatbot
Our NOC engineering team has been busy building a customer support chatbot to enhance our customer support experience with AI. The API we built to support a RAG pipeline was directly informed by our team’s needs.
Granular control of embeddings
Anyone who’s worked with embedding documents knows there are an overwhelming number of choices to make, all of which could impact your chatbot’s performance:
- Which embedding model do I use?
- How do I split my documents?
- How much overlap should the chunks have?
- What metadata do I include?
- What distance algorithm should I use?
- Should I rerank?
- How many results should I include in the prompt?
- What should the format of the prompt be?
To help folks get started quickly, our embedding and inference API makes reasonable default choices for everything. All you have to do is upload files to a Telnyx Storage bucket, embed them, and start talking to a language model. When you add new files, they’ll be embedded automatically.
This process works for a first pass, but if you’re as serious about realizing the full potential of your chatbot as our NOC engineering team is, you’ll want control over these decisions. Between our embedding and similarity search APIs, we’ve exposed full control to our customers.
Another request from our team was to implement a custom Intercom loader to improve search results. They noticed the default loaders provided by open-source libraries were insufficient, ultimately creating their own implementation to achieve their desired results. A version of that implementation is now available to all customers via our custom Intercom loader.
Similarity search and custom loaders are available now in our API documentation.
Optimizing LLMs for low-latency conversational AI
As a leading communications provider, we’re very excited to support conversational AI tools, in addition to adding support for our in-house chatbot
During Open AprIl, Telnyx Senior Software Engineer Enzo Piacenza built a conversational AI bot that must reduce its latency to mimic human speech. This project prompted our AI squad to build the necessary APIs to support real-time voice interactions. The team worked to run quantized models on our GPUs to create a bot that could mimic authentic, human-like speech that’s both high quality and blazing fast.
A quantized model is a language model intentionally made to be a little less "precise" to make it smaller and faster. For this reason, quantized models are a popular way to run LLMs on storage- and energy-constrained consumer hardware.
While those are important considerations for conversational AI at Telnyx, they’re not the key constraint. What makes quantized models so powerful for this use case is their latency optimization, with little reduction in quality.
The quantized Mistral 7B Instruct model running on our GPUs generates 120 tokens per second for a batch size of one—making it viable for use in conversational AI tools like Enzo’s.
What will you build with Telnyx Inference?
Things are moving fast in the AI space, and it can be hard to keep up. We’ve only begun to scratch the surface of what’s possible with AI—especially when paired with intuitive APIs for storage and voice.
If you’ve been building your own AI projects this year, we would love to hear from you. Share your project with our Developer Community.
We can’t wait to see what you build on Telnyx.
Sign up for emails of our latest articles and news