Learn how to build a retrieval-augmented generation (RAG) API with Telnyx AI Inference. Use Telnyx embeddings to vectorize a knowledge base, retrieve relevant documents, and send grounded context to Telnyx chat completions for accurate answers.
A generic chatbot can answer anything. That is also the problem.
If you are building support tooling, internal search, onboarding assistants, or product help, you usually do not want an answer from the whole internet. You want an answer from your documents.
That is the job of retrieval-augmented generation, or RAG. Retrieve the relevant context first. Then ask the model to answer from that context.
This example builds a small RAG API with Telnyx AI Inference. It uses Telnyx embeddings to vectorize a sample knowledge base, retrieves relevant documents for each question, and sends the retrieved context to Telnyx chat completions for a grounded answer.
The Flask app exposes three endpoints:
POST /rag/ask - retrieve relevant docs and answer a questionGET /documents - inspect the sample knowledge baseGET /health - check configured models and document countThe sample knowledge base includes documents about API key authentication, rate limits, webhook troubleshooting, verification message delivery, and billing support.
This is intentionally small. The point is to make the full RAG loop visible in one file.
Set your API key and optional model choices:
Inspect the knowledge base:
Ask a question:
The response includes an answer, the chat model, the embedding model, and source titles with similarity scores.
The app creates embeddings with POST /v2/ai/embeddings:
Each document becomes a list of numbers. The user question also becomes a list of numbers. Then the app compares the question vector to each document vector with cosine similarity.
The retrieval function ranks documents and returns the top matches:
This example stores embeddings in memory. That keeps the code easy to read. In production, you would store vectors in a database or search system and retrieve by similarity there.
After retrieval, the app sends the selected documents to Telnyx chat completions:
That instruction is the guardrail. The assistant is not supposed to improvise. It answers from the context or says it does not know.
RAG makes AI features easier to trust because the answer is tied to source material. Users can see which documents were used. Developers can update the knowledge base without retraining a model. Product teams can start with a small corpus and expand over time.
This example is not trying to be a full production search stack. It is the smallest useful version of the loop:
Once that loop works, production upgrades are straightforward.
Chunk long documents before embedding. Smaller chunks usually retrieve better than entire pages.
Persist embeddings outside the process. In-memory storage is fine for a demo, but production systems should use a vector database, search engine, or database extension.
Return source titles or URLs with the answer. Source visibility helps users trust the result and helps developers debug retrieval quality.
Add authentication before exposing the API. The sample is intentionally local and minimal.
The example is available in the Telnyx code examples repo:
https://github.com/team-telnyx/telnyx-code-examples/tree/main/build-rag-with-telnyx-inference-python
It includes the Flask app, quickstart, API reference, guide, requirements file, and environment template.
Related articles