Insights and Resources

Building a RAG pipeline with Telnyx Cloud Storage

From documents to answers: a step-by-step guide to building a production RAG pipeline with Telnyx Cloud Storage

Retrieval-augmented generation (RAG) has become the standard pattern for grounding large language models in private data. Instead of relying on training data alone, a RAG pipeline retrieves relevant context from your own documents and injects it into the prompt, giving you accurate, domain-specific answers without fine-tuning.

Most tutorials focus on retrieval logic: chunking strategies, embedding models, vector databases. What they skip is where your documents actually live. The storage layer isn't a footnote. It determines your pipeline's cost, latency, and operational complexity at every retrieval operation, and those effects compound at scale.

This tutorial builds a working RAG pipeline using Telnyx Cloud Storage as the document backbone. The LLM layer uses Claude as a concrete example, but the same architecture works with OpenAI, Mistral, Llama, or any hosted or self-hosted model. The storage and retrieval layers don't change when you switch providers; the model is a replaceable component.

Prerequisites

Accounts and credentials You'll need a Telnyx Cloud Storage account. The free tier includes 10 GiB storage, 1M write and 10M read operations per month with no credit card required. If you haven't set up a bucket before, the quick start guide walks through the portal in a few minutes. You'll also need an API key for whichever LLM you're using, the code examples below use Anthropic Claude, but any provider works.

Install dependencies

pip install boto3 anthropic requests
# No vector database dependency needed — Telnyx handles embeddings server-side

Set environment variables

# All values below are your Telnyx API key
export AWS_ACCESS_KEY=your_telnyx_api_key
export AWS_SECRET_KEY=your_telnyx_api_key
export TELNYX_API_KEY=your_telnyx_api_key

# You can only pick a region when creating a bucket.
# All other Cloud Storage operations must go to the bucket's regional endpoint.
# Find your bucket's location with: s3.get_bucket_location(Bucket="name")
export TELNYX_ENDPOINT=https://us-east-1.telnyxcloudstorage.com

# boto3 reads AWS_ACCESS_KEY + AWS_SECRET_KEY (both = your Telnyx API key)
# AI/embedding operations use TELNYX_API_KEY directly

# Set whichever LLM provider you're using
export ANTHROPIC_API_KEY=your_anthropic_key
# export OPENAI_API_KEY=your_openai_key

For available regional endpoints, see the API endpoints reference. Pick the region closest to where your inference workload runs.

Pipeline architecture

The pipeline flows through three stages. Storage sits at the centre of all three, it's where documents live, where embeddings originate, and where every retrieval query fetches data from.

Pipeline architecture

Stage	What happens	Key component
1 — Store	Embed stored documents server-side, query with similarity search	Telnyx Embed API + Similarity Search
2 — Embed	Generate vectors from stored files, index locally	Telnyx Embed API + ChromaDB
3 — Generate	Retrieve relevant chunks, pass to LLM	Any LLM API or self-hosted model

Stage 1: Store your documents

Telnyx Cloud Storage is fully S3-compatible, so your existing boto3 code works as-is. The only change from a standard S3 setup is the endpoint URL.

import boto3
import os
from pathlib import Path

# ── Configure the client ────────────────────────────────────────────────
s3 = boto3.client(
  's3',
  endpoint_url=os.environ['TELNYX_ENDPOINT'],
  aws_access_key_id=os.environ['AWS_ACCESS_KEY'],
  aws_secret_access_key=os.environ['AWS_SECRET_KEY'],
)

BUCKET = 'rag-knowledge-base'

# ── Create bucket (idempotent) ──────────────────────────────────────────
existing = [b['Name'] for b in s3.list_buckets().get('Buckets', [])]
if BUCKET not in existing:
  s3.create_bucket(Bucket=BUCKET)
  print(f'Created bucket: {BUCKET}')

# ── Upload documents ─────────────────────────────────────────────────────
docs_dir = Path('./documents')

for doc_path in docs_dir.glob('**/*'):
  if doc_path.is_file():
      key = f'documents/{doc_path.name}'
      s3.upload_file(str(doc_path), BUCKET, key)
      print(f'Uploaded: {key}')

Any file type works, PDF, DOCX, TXT, Markdown. The Telnyx Embed API in Stage 2 handles text extraction automatically. If your documents are already in an AWS S3 bucket, the S3 Migration API transfers everything to Telnyx without you paying AWS egress charges (currently US buckets only). It's a configuration change, not a rewrite. Review the S3 compatibility reference for any API differences relevant to your workload.

Stage 2: Generate embeddings and build the index

This is where Telnyx Cloud Storage diverges from a standard object store. Rather than downloading each document, running it through a separate embedding model, and managing a sync pipeline between services, you call the Telnyx Embed API directly against files already in storage. Embeddings are generated server-side from the stored object, with no data duplication and no orchestration layer to maintain.

A typical self-managed RAG stack for this step chains: object storage → embed bucket via API → similarity search to retrieve, all from the same service. Telnyx collapses storage and embedding into a single API call. When a document changes, one request regenerates its vector against the current file in place.

Embed your bucket

import requests

TELNYX_API_KEY = os.environ['TELNYX_API_KEY']


def embed_bucket(bucket: str) -> dict:
  """
  Embed all documents in a Telnyx Cloud Storage bucket.
  Telnyx handles text extraction, chunking, and indexing server-side.
  No need to download files or manage a local vector store.
  """
  response = requests.post(
      'https://api.telnyx.com/v2/ai/embeddings/url',
      headers={
          'Authorization': f'Bearer {TELNYX_API_KEY}',
          'Content-Type': 'application/json',
      },
      json={'bucket_name': bucket}
  )
  response.raise_for_status()
  return response.json()


def search_documents(bucket: str, query: str, num_docs: int = 3) -> list[dict]:
  """
  Perform a similarity search on an embedded bucket.
  Returns the most relevant document chunks with their content and metadata.
  """
  response = requests.post(
      'https://api.telnyx.com/v2/ai/embeddings/similarity-search',
      headers={
          'Authorization': f'Bearer {TELNYX_API_KEY}',
          'Content-Type': 'application/json',
      },
      json={'bucket_name': bucket, 'query': query, 'num_docs': num_docs}
  )
  response.raise_for_status()
  return response.json()['results']

Trigger embedding and verify

# ── Embed the bucket (one-time or when documents change) ────────────────
# This can also be done via the portal: Storage → Buckets → "Embed for AI Use"
embed_bucket(BUCKET)
print(f'Bucket {BUCKET} embedded and ready for search')

# ── Or embed individual documents via the portal ───────────────────────
# Navigate to: portal.telnyx.com/#/storage/buckets
# Select your bucket → click "Embed for AI Use" on any document

Stage 3: Retrieve context and generate answers

With documents stored and indexed, the query path is: call the Telnyx Similarity Search API against your embedded bucket, retrieve the most relevant chunks, and pass everything to your LLM. The retrieval logic is identical regardless of which model you use at the end.

Search your embedded documents

def retrieve_context(question: str, n_results: int = 3) -> list[str]:
  """
  Find the most relevant documents using Telnyx similarity search.
  No need for a separate vector database — search is server-side.
  """
  results = search_documents(BUCKET, question, num_docs=n_results)

  context_chunks = []
  for result in results:
      # Each result includes the chunk text and metadata
      chunk_text = result.get('text', '')
      if chunk_text:
          context_chunks.append(chunk_text)

  return context_chunks

Generate an answer

The example below uses Claude. To switch providers, replace the anthropic_client.messages.create block with your provider's equivalent. Everything above it, storage, embedding, retrieval, stays the same.

from anthropic import Anthropic

anthropic_client = Anthropic()  # reads ANTHROPIC_API_KEY from env


def answer(question: str, n_results: int = 3) -> str:
  context_chunks = retrieve_context(question, n_results=n_results)
  context = '\n\n---\n\n'.join(context_chunks)

  response = anthropic_client.messages.create(
      model='claude-sonnet-4-20250514',
      max_tokens=1024,
      system=(
          'You are a helpful assistant. Answer questions using only '
          'the provided context. If the answer is not in the context, say so.'
      ),
      messages=[{
          'role': 'user',
          'content': f'Context:\n{context}\n\nQuestion: {question}',
      }],
  )
  return response.content[0].text


# ── Run it ──────────────────────────────────────────────────────────────
if __name__ == '__main__':
  questions = [
      'What are the egress pricing differences between providers?',
      'How do I migrate an existing S3 bucket to Telnyx?',
      'What embedding models does Telnyx support?',
  ]
  for q in questions:
      print(f'Q: {q}')
      print(f'A: {answer(q)}')
      print()

Switching to a different LLM

To use OpenAI instead of Claude:

from openai import OpenAI  # pip install openai

openai_client = OpenAI()  # reads OPENAI_API_KEY from env


def answer(question: str, n_results: int = 3) -> str:
  context_chunks = retrieve_context(question, n_results=n_results)
  context = '\n\n---\n\n'.join(context_chunks)

  resp = openai_client.chat.completions.create(
      model='gpt-4o',
      messages=[{
          'role': 'user',
          'content': f'Context:\n{context}\n\nQuestion: {question}'
      }]
  )
  return resp.choices[0].message.content

For self-hosted models running via Ollama or vLLM, point the client at your local endpoint. The retrieve_context() function doesn't change.

Scaling considerations

As your document corpus grows, three factors determine whether the pipeline scales gracefully or becomes a cost and performance problem.

Egress costs

RAG pipelines are read-heavy. Every question your application answers triggers at least one document fetch, often several. Major cloud providers charge $0.08–$0.12 per GB on every byte transferred out of storage. At production scale that isn't a rounding error, it becomes one of the largest line items in your infrastructure budget.

Egress pricing comparison

Provider	Egress per GB	Notes
AWS S3	$0.09	Standard pricing
Google Cloud Storage	$0.08–0.12	Region-dependent
Azure Blob Storage	$0.087	Standard tier
Telnyx Cloud Storage	$0.00	No egress fees

To put that in concrete terms: a RAG application handling 50,000 queries per day, each fetching an average of 500 KB of context, moves roughly 750 GB per month out of storage. At $0.09/GB that's $67.50/month in egress charges alone, before compute, capacity, or API costs. At $0.00/GB it's nothing, and that number doesn't grow as query volume scales.

Telnyx charges $0.006/GiB for storage in the US, with separate pricing for EU (eu-central-1) and Australia regions. Regardless of region, egress is always $0.00. See the full pricing page for current rates by region.

Embedding freshness

Documents change. Product documentation gets updated, knowledge bases evolve, new files are added. In a decoupled architecture, separate storage, embedding service, and vector DB, keeping vectors in sync requires orchestration logic, scheduled jobs, and drift monitoring. With Telnyx's native embedding, updating a document's vector is a single API call to re-embed the bucket or a click in the portal. No sync job. No stale embeddings. No orchestration layer to maintain.

Regional latency

Where your storage lives relative to your compute affects every retrieval round-trip. Telnyx runs storage on a private backbone network, not public internet transit, across US, EU, and APAC regions. This gives lower and more predictable latency than public cloud egress paths. Use the API endpoints reference to pick the region co-located with your inference workload.

Why storage choice matters for RAG

The storage layer in a RAG pipeline is not passive infrastructure. It's an active component that shapes cost, complexity, and iteration speed at every retrieval operation.

Cost predictability

Hyperscaler egress pricing turns retrieval-heavy workloads into unpredictable expenses. As query volume grows, cloud bills grow non-linearly, because every retrieval is a billable data transfer. Zero-egress pricing makes retrieval costs flat regardless of how much data your pipeline reads. That's the difference between costs that scale with the value your product delivers and costs that scale with every single query.

Pipeline complexity

A typical self-managed RAG stack requires: an object store, an embedding service, a vector database, and orchestration logic to keep them in sync. That's four components to provision, monitor, and maintain. Telnyx's native embedding collapses two of those into one, you store a document and embedding is an API call away from the same service. Fewer integration points means fewer failure modes and faster debugging when something goes wrong.

Iteration speed

Tuning a RAG pipeline involves frequent experimentation: adjusting chunk sizes, testing different embedding models, expanding or pruning the document corpus. Each iteration requires re-embedding affected documents. When that's a single API call per file rather than a pipeline rebuild, experimentation cycles shrink from hours to minutes. The friction in your pipeline determines how fast you can improve retrieval quality.

Next steps

The pipeline above is functional and production-ready at modest scale. Here's where to take it next:

Add chunking - Long documents benefit from being split into smaller chunks before embedding, each chunk gets its own vector and retrieval becomes more precise. LangChain's text splitters handle this well and integrate cleanly with the indexing loop in Stage 2.
Tune retrieval quality - Experiment with the n_results parameter and cosine distance thresholds. ChromaDB's where clause also lets you filter by document metadata before semantic search, useful for multi-tenant or access-controlled knowledge bases.
Handle document updates incrementally - When a file changes, call embed_document() again and upsert the new vector with collection.upsert(). ChromaDB overwrites the old entry by ID. The Stage 2 indexing loop already skips existing IDs, so you can run it on a schedule safely.
Scale your vector store - ChromaDB is ideal for development and moderate corpus sizes. For production at scale, Qdrant (open source, Apache 2.0, 6ms p50 latency at 1M vectors) and Pinecone (fully managed, no infrastructure to operate) are the leading options. The retrieve_context() function is the only thing that changes when you swap vector stores.
Explore the full Telnyx AI stack - Telnyx's AI Inference blog post covers how their team built a production RAG chatbot on the same infrastructure, including the embedding and inference API decisions they made along the way.

Ready to build your RAG pipeline?

High-performance object storage with zero egress fees and native AI embedding: free to start, no credit card required.

Explore Telnyx Cloud Storage →

Share on Social

Lucia Lucena

Senior Product Marketing Manager

Building a RAG pipeline with Telnyx Cloud Storage

Sign up for emails of our latest articles and news

Ask AI