From documents to answers: a step-by-step guide to building a production RAG pipeline with Telnyx Cloud Storage

Retrieval-augmented generation (RAG) has become the standard pattern for grounding large language models in private data. Instead of relying on training data alone, a RAG pipeline retrieves relevant context from your own documents and injects it into the prompt, giving you accurate, domain-specific answers without fine-tuning.
Most tutorials focus on retrieval logic: chunking strategies, embedding models, vector databases. What they skip is where your documents actually live. The storage layer isn't a footnote. It determines your pipeline's cost, latency, and operational complexity at every retrieval operation, and those effects compound at scale.
This tutorial builds a working RAG pipeline using Telnyx Cloud Storage as the document backbone. The LLM layer uses Claude as a concrete example, but the same architecture works with OpenAI, Mistral, Llama, or any hosted or self-hosted model. The storage and retrieval layers don't change when you switch providers; the model is a replaceable component.
Prerequisites
Accounts and credentials You'll need a Telnyx Cloud Storage account. The free tier includes 10 GiB storage, 1M write and 10M read operations per month with no credit card required. If you haven't set up a bucket before, the quick start guide walks through the portal in a few minutes. You'll also need an API key for whichever LLM you're using, the code examples below use Anthropic Claude, but any provider works.
Install dependencies
Set environment variables
For available regional endpoints, see the API endpoints reference. Pick the region closest to where your inference workload runs.
Pipeline Architecture
The pipeline flows through three stages. Storage sits at the centre of all three, it's where documents live, where embeddings originate, and where every retrieval query fetches data from.
Pipeline architecture
| Stage | What happens | Key component |
1 — Store | Embed stored documents server-side, query with similarity search | Telnyx Embed API + Similarity Search |
2 — Embed |
Stage 1: Store Your Documents
Telnyx Cloud Storage is fully S3-compatible, so your existing boto3 code works as-is. The only change from a standard S3 setup is the endpoint URL.
Any file type works, PDF, DOCX, TXT, Markdown. The Telnyx Embed API in Stage 2 handles text extraction automatically. If your documents are already in an AWS S3 bucket, the S3 Migration API transfers everything to Telnyx without you paying AWS egress charges (currently US buckets only). It's a configuration change, not a rewrite. Review the S3 compatibility reference for any API differences relevant to your workload.
Stage 2: Generate Embeddings and Build the Index
This is where Telnyx Cloud Storage diverges from a standard object store. Rather than downloading each document, running it through a separate embedding model, and managing a sync pipeline between services, you call the Telnyx Embed API directly against files already in storage. Embeddings are generated server-side from the stored object, with no data duplication and no orchestration layer to maintain.
A typical self-managed RAG stack for this step chains: object storage → embed bucket via API → similarity search to retrieve, all from the same service. Telnyx collapses storage and embedding into a single API call. When a document changes, one request regenerates its vector against the current file in place.
Embed your bucket
Trigger embedding and verify
Stage 3: Retrieve Context and Generate Answers
With documents stored and indexed, the query path is: call the Telnyx Similarity Search API against your embedded bucket, retrieve the most relevant chunks, and pass everything to your LLM. The retrieval logic is identical regardless of which model you use at the end.
Search your embedded documents
Generate an answer
The example below uses Claude. To switch providers, replace the anthropic_client.messages.create block with your provider's equivalent. Everything above it, storage, embedding, retrieval, stays the same.
Switching to a different LLM
To use OpenAI instead of Claude:
For self-hosted models running via Ollama or vLLM, point the client at your local endpoint. The retrieve_context() function doesn't change.
Scaling Considerations
As your document corpus grows, three factors determine whether the pipeline scales gracefully or becomes a cost and performance problem.
Egress costs
RAG pipelines are read-heavy. Every question your application answers triggers at least one document fetch, often several. Major cloud providers charge $0.08–$0.12 per GB on every byte transferred out of storage. At production scale that isn't a rounding error, it becomes one of the largest line items in your infrastructure budget.
Egress pricing comparison
| Provider | Egress per GB | Notes |
| AWS S3 | $0.09 | Standard pricing |
| Google Cloud Storage | $0.08–0.12 | Region-dependent |
| Azure Blob Storage |
To put that in concrete terms: a RAG application handling 50,000 queries per day, each fetching an average of 500 KB of context, moves roughly 750 GB per month out of storage. At $0.09/GB that's $67.50/month in egress charges alone, before compute, capacity, or API costs. At $0.00/GB it's nothing, and that number doesn't grow as query volume scales.
Telnyx charges $0.006/GiB for storage in the US, with separate pricing for EU (eu-central-1) and Australia regions. Regardless of region, egress is always $0.00. See the full pricing page for current rates by region.
Embedding freshness
Documents change. Product documentation gets updated, knowledge bases evolve, new files are added. In a decoupled architecture, separate storage, embedding service, and vector DB, keeping vectors in sync requires orchestration logic, scheduled jobs, and drift monitoring. With Telnyx's native embedding, updating a document's vector is a single API call to re-embed the bucket or a click in the portal. No sync job. No stale embeddings. No orchestration layer to maintain.
Regional latency
Where your storage lives relative to your compute affects every retrieval round-trip. Telnyx runs storage on a private backbone network, not public internet transit, across US, EU, and APAC regions. This gives lower and more predictable latency than public cloud egress paths. Use the API endpoints reference to pick the region co-located with your inference workload.
Why Storage Choice Matters for RAG
The storage layer in a RAG pipeline is not passive infrastructure. It's an active component that shapes cost, complexity, and iteration speed at every retrieval operation.
Cost predictability
Hyperscaler egress pricing turns retrieval-heavy workloads into unpredictable expenses. As query volume grows, cloud bills grow non-linearly, because every retrieval is a billable data transfer. Zero-egress pricing makes retrieval costs flat regardless of how much data your pipeline reads. That's the difference between costs that scale with the value your product delivers and costs that scale with every single query.
Pipeline complexity
A typical self-managed RAG stack requires: an object store, an embedding service, a vector database, and orchestration logic to keep them in sync. That's four components to provision, monitor, and maintain. Telnyx's native embedding collapses two of those into one, you store a document and embedding is an API call away from the same service. Fewer integration points means fewer failure modes and faster debugging when something goes wrong.
Iteration speed
Tuning a RAG pipeline involves frequent experimentation: adjusting chunk sizes, testing different embedding models, expanding or pruning the document corpus. Each iteration requires re-embedding affected documents. When that's a single API call per file rather than a pipeline rebuild, experimentation cycles shrink from hours to minutes. The friction in your pipeline determines how fast you can improve retrieval quality.
Next Steps
The pipeline above is functional and production-ready at modest scale. Here's where to take it next:
Add chunking - Long documents benefit from being split into smaller chunks before embedding, each chunk gets its own vector and retrieval becomes more precise. LangChain's text splitters handle this well and integrate cleanly with the indexing loop in Stage 2.
Tune retrieval quality - Experiment with the n_results parameter and cosine distance thresholds. ChromaDB's where clause also lets you filter by document metadata before semantic search, useful for multi-tenant or access-controlled knowledge bases.
Handle document updates incrementally - When a file changes, call embed_document() again and upsert the new vector with collection.upsert(). ChromaDB overwrites the old entry by ID. The Stage 2 indexing loop already skips existing IDs, so you can run it on a schedule safely.
Ready to build your RAG pipeline?
High-performance object storage with zero egress fees and native AI embedding: free to start, no credit card required.
Explore Telnyx Cloud Storage →Related articles
| Generate vectors from stored files, index locally |
| Telnyx Embed API + ChromaDB |
3 — Generate | Retrieve relevant chunks, pass to LLM | Any LLM API or self-hosted model |
| $0.087 |
| Standard tier |
| Telnyx Cloud Storage | $0.00 | No egress fees |
Scale your vector store - ChromaDB is ideal for development and moderate corpus sizes. For production at scale, Qdrant (open source, Apache 2.0, 6ms p50 latency at 1M vectors) and Pinecone (fully managed, no infrastructure to operate) are the leading options. The retrieve_context() function is the only thing that changes when you swap vector stores.
Explore the full Telnyx AI stack - Telnyx's AI Inference blog post covers how their team built a production RAG chatbot on the same infrastructure, including the embedding and inference API decisions they made along the way.