Conversational AI

How to Build a RAG Application with Telnyx Inference

Learn how to build a retrieval-augmented generation (RAG) API with Telnyx AI Inference. Use Telnyx embeddings to vectorize a knowledge base, retrieve relevant documents, and send grounded context to Telnyx chat completions for accurate answers.

By Sonam Gupta

A generic chatbot can answer anything. That is also the problem.

If you are building support tooling, internal search, onboarding assistants, or product help, you usually do not want an answer from the whole internet. You want an answer from your documents.

That is the job of retrieval-augmented generation, or RAG. Retrieve the relevant context first. Then ask the model to answer from that context.

This example builds a small RAG API with Telnyx AI Inference. It uses Telnyx embeddings to vectorize a sample knowledge base, retrieves relevant documents for each question, and sends the retrieved context to Telnyx chat completions for a grounded answer.

What the App Does

The Flask app exposes three endpoints:

  • POST /rag/ask - retrieve relevant docs and answer a question
  • GET /documents - inspect the sample knowledge base
  • GET /health - check configured models and document count

The sample knowledge base includes documents about API key authentication, rate limits, webhook troubleshooting, verification message delivery, and billing support.

The Architecture

User question
      |
      v
Embed question with Telnyx
      |
      v
Compare against document embeddings
      |
      v
Send retrieved context to Telnyx AI
      |
      v
Grounded answer + source titles

This is intentionally small. The point is to make the full RAG loop visible in one file.

Clone and Run

git clone https://github.com/team-telnyx/telnyx-code-examples.git
cd telnyx-code-examples/build-rag-with-telnyx-inference-python
cp .env.example .env
pip install -r requirements.txt
python app.py

Set your API key and optional model choices:

TELNYX_API_KEY=your_telnyx_api_key
AI_MODEL=moonshotai/Kimi-K2.6
EMBEDDING_MODEL=thenlper/gte-large
HOST=127.0.0.1
PORT=5000

Inspect the knowledge base:

curl http://localhost:5000/documents

Ask a question:

curl -X POST http://localhost:5000/rag/ask \
  -H "Content-Type: application/json" \
  -d '{ "question": "Production signup broke after rotating an API key. Logs show 401 errors. What should we check first?" }'

The response includes an answer, the chat model, the embedding model, and source titles with similarity scores.

Embeddings: Turning Documents into Searchable Vectors

The app creates embeddings with POST /v2/ai/embeddings:

response = requests.post(
    f"{API_BASE}/embeddings",
    headers=_headers(),
    json={"model": EMBEDDING_MODEL, "input": inputs},
    timeout=60,
)

Each document becomes a list of numbers. The user question also becomes a list of numbers. Then the app compares the question vector to each document vector with cosine similarity.

Retrieval: Pick the Most Relevant Context

The retrieval function ranks documents and returns the top matches:

def retrieve(query: str, top_k: int = 3) -> list[dict]:
    query_embedding = create_embeddings(query)[0]
    document_embeddings = ensure_document_embeddings()

    scored = []
    for doc, embedding in zip(DOCUMENTS, document_embeddings):
        scored.append({
            "title": doc["title"],
            "text": doc["text"],
            "score": cosine_similarity(query_embedding, embedding),
        })

    return sorted(scored, key=lambda item: item["score"], reverse=True)[:top_k]

This example stores embeddings in memory. That keeps the code easy to read. In production, you would store vectors in a database or search system and retrieve by similarity there.

Generation: Answer Only from Context

After retrieval, the app sends the selected documents to Telnyx chat completions:

"content": (
    "You are a helpful support assistant. Answer the user's question "
    "using only the provided context. If the context does not contain "
    "the answer, say you do not know."
)

That instruction is the guardrail. The assistant is not supposed to improvise. It answers from the context or says it does not know.

Why This Pattern Works

RAG makes AI features easier to trust because the answer is tied to source material. Users can see which documents were used. Developers can update the knowledge base without retraining a model. Product teams can start with a small corpus and expand over time.

This example is not trying to be a full production search stack. It is the smallest useful version of the loop:

  1. Embed docs
  2. Embed question
  3. Retrieve sources
  4. Answer from sources

Once that loop works, production upgrades are straightforward.

Production Notes

Chunk long documents before embedding. Smaller chunks usually retrieve better than entire pages.

Persist embeddings outside the process. In-memory storage is fine for a demo, but production systems should use a vector database, search engine, or database extension.

Return source titles or URLs with the answer. Source visibility helps users trust the result and helps developers debug retrieval quality.

Add authentication before exposing the API. The sample is intentionally local and minimal.

Get the Code

The example is available in the Telnyx code examples repo:

https://github.com/team-telnyx/telnyx-code-examples/tree/main/build-rag-with-telnyx-inference-python

It includes the Flask app, quickstart, API reference, guide, requirements file, and environment template.

Share on Social