Conversational AI

How to Build an AI Audio Translator with Telnyx STT, Inference, and TTS

Learn how to build an AI audio translator with Telnyx STT, AI Inference, and TTS. Transcribe source audio, translate the transcript, and generate target-language speech in one Flask pipeline.

By Sonam Gupta

Audio content is hard to localize manually. A podcast clip, customer interview, lecture, meeting recording, or product walkthrough has to be transcribed, translated, reviewed, and turned back into spoken audio before it is useful in another language.

This example shows the core workflow in one small Flask app:

  1. Upload an audio file
  2. Transcribe it with Telnyx Speech-to-Text
  3. Translate the transcript with Telnyx AI Inference
  4. Generate target-language speech with Telnyx Text-to-Speech
  5. Return a job result with transcripts and audio metadata

The full example is open source in telnyx-code-examples under ai-content-translator-python.

What the App Does

The app exposes four routes:

  • POST /translate - upload audio and start the STT -> translation -> TTS pipeline
  • GET /translate/<job_id> - retrieve the full translation job
  • GET /languages - list supported target languages
  • GET /health - check service status

The app supports English, Spanish, French, German, Portuguese, Japanese, Korean, Chinese, Arabic, Hindi, and Italian language codes. The sample stores translation jobs in memory so the flow stays easy to inspect.

The Architecture

Audio upload
     |
     v
Telnyx STT transcribes source audio
     |
     v
Telnyx AI Inference translates transcript
     |
     v
Telnyx TTS generates target-language speech
     |
     v
Job result with transcripts and audio segment metadata

This is not a full dubbing studio. It is the smallest useful version of the pipeline, designed so developers can see each step clearly.

Clone and Run

git clone https://github.com/team-telnyx/telnyx-code-examples.git
cd telnyx-code-examples/ai-content-translator-python
cp .env.example .env
pip install -r requirements.txt
python app.py

Set your API key and optional model choices:

TELNYX_API_KEY=your_telnyx_api_key
AI_MODEL=moonshotai/Kimi-K2.6
TTS_MODEL=telnyx/tts
STT_MODEL=telnyx/asr
HOST=127.0.0.1

Check supported languages:

curl http://localhost:5000/languages

Upload an audio file:

curl -X POST http://localhost:5000/translate \
  -F [email protected] \
  -F source=en \
  -F target=ja

The response includes a job_id, status, source and target languages, transcript lengths, audio segment count, and preview text.

The Core Pipeline

The app has three helper functions that mirror the workflow.

transcribe() sends the uploaded audio bytes to Telnyx STT:

resp = requests.post(
    f"{API}/ai/transcribe",
    headers={"Authorization": f"Bearer {TELNYX_API_KEY}"},
    files={"file": ("audio.mp3", audio_bytes, "audio/mpeg")},
    data={"model": STT_MODEL, "language": language, "timestamps": True},
    timeout=120,
)

inference() sends a translation prompt to Telnyx AI Inference:

resp = requests.post(
    f"{API}/ai/chat/completions",
    headers=HEADERS,
    json={
        "model": AI_MODEL,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": 0.3,
    },
    timeout=60,
)

tts_generate() turns the translated text into audio:

resp = requests.post(
    f"{API}/ai/generate",
    headers=HEADERS,
    json={
        "model": TTS_MODEL,
        "voice": voice,
        "text": text,
        "language": language,
        "output_format": "mp3",
    },
    timeout=30,
)

Each step is visible, debuggable, and replaceable.

Why This Pattern Works

Many audio translation workflows become complicated because each step lives in a different service. You transcribe with one provider, send text to another model provider, then generate audio with a third voice provider.

This example keeps the pipeline inside Telnyx AI APIs. That reduces integration overhead and makes the workflow easier to reason about.

It also uses job IDs instead of trying to return every internal detail in the first response. That gives the app a shape that can grow into a production workflow with persistent storage, status polling, retries, and downloadable audio files.

Production Notes

Persist jobs in a database instead of memory. The sample uses an in-memory dictionary because it is easy to read.

Store generated audio in object storage and return signed download URLs. The sample tracks audio segment sizes, but production users will want actual files.

Add authentication before exposing the Flask API. Uploaded audio may contain sensitive content.

Chunk long transcripts intentionally. The sample chunks TTS text at 1,000 characters, but production systems should split on sentence boundaries.

Add human review when tone, legal language, or brand voice matters.

Share on Social