# Telnyx Voice: TTS — Full Documentation > Complete page content for TTS (Voice section) of the Telnyx developer docs (https://developers.telnyx.com). > Root index: https://developers.telnyx.com/llms.txt · Lightweight index for this subsection: https://telnyx-openapi-ng.s3.us-east-1.amazonaws.com/llms/voice/tts.txt ## ### Overview > Source: https://developers.telnyx.com/docs/voice/tts/overview.md ## 1. Choose your interface Real-time streaming. Send text, receive audio chunks as they're synthesized. HTTP POST. Get audio back as binary, base64, or async URL. OpenAI SDK compatible. TTS during live calls via Call Control `speak` or TeXML ``. ## 2. Choose a pre-built voice Natural, NaturalHD, Ultra, Kokoro, Qwen3TTS, xAI Grok. AWS Polly, Azure, ElevenLabs, Minimax, MurfAI, Rime, Resemble, Inworld. ## 3. Or create your own Clone and design custom voices. Available on select providers: Qwen3TTS, Minimax, ElevenLabs, Resemble. --- ## WebSocket Streaming ### Lifecycle > Source: https://developers.telnyx.com/docs/voice/tts/websocket-streaming.md Real-time text-to-speech over a persistent WebSocket connection. Send text, receive audio. ## Endpoint ``` wss://api.telnyx.com/v2/text-to-speech/speech ``` ## Connection Lifecycle ### 1. Handshake There are two ways to establish a connection: #### Direct WebSocket connection You can connect directly to the WebSocket endpoint by passing all configuration as query parameters in the `wss://` URL: ``` wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra ``` Most WebSocket clients and libraries support this natively — simply open a WebSocket connection to the URL and begin the message flow. No separate HTTP request is needed. #### HTTP upgrade Alternatively, initiate the connection as an HTTP GET request that upgrades to a WebSocket via the standard `101 Switching Protocols` handshake. This is what happens under the hood when a WebSocket client connects, and may be relevant if you need fine-grained control over the upgrade (e.g., setting custom headers in environments where the WebSocket library doesn't expose them directly). #### Initialization frame Regardless of how the connection is established, send an initialization frame before any text: ```json {"text": " "} ``` The initialization frame may include `voice_settings` to configure provider-specific parameters: ```json { "text": " ", "voice_settings": { "voice_speed": 1.2 } } ``` All configuration — query parameters and voice settings — is locked before synthesis begins. See [Configuration](/docs/voice/tts/websocket-streaming/configuration) for both surfaces and the full parameter reference. ### 2. Streaming Once initialized, text and audio flow concurrently — no request/response pairing. Text is buffered and synthesized at sentence boundaries. **Client → Server** | Frame type | Content | |-----------|---------| | text | `{"text": "Hello."}` — text to synthesize | | text | `{"text": "...", "flush": true}` — force immediate synthesis of buffered text | | text | `{"force": true}` — interrupt current synthesis (barge-in), restart worker | | text | `{"text": ""}` — end of sequence, flush remaining buffer and close | **Server → Client** | Message | Description | |---------|-------------| | Audio chunk | `{"audio":"","text":"Hello.","isFinal":false}` | | Streamed chunk | `{"audio":"","text":null,"isFinal":false}` (most providers) | | Final frame | `{"audio":null,"text":"","isFinal":true}` — synthesis complete | | Error | `{"error":"..."}` — connection closes after | See [Messages](/docs/voice/tts/websocket-streaming/messages) for the complete wire protocol reference. ``` Client → Server {"text":" "} (handshake) Client → Server {"text":"Hello, welcome."} Client → Server {"text":" How are you?"} (sentence boundary detected) Client ← Server {"audio":"","isFinal":false} (streamed chunks) Client ← Server {"audio":"","isFinal":false} Client ← Server {"audio":null,"isFinal":true} (synthesis complete) Client → Server {"text":""} (end of sequence) Client ← Server remaining audio + final frame connection closes ``` **Text buffering:** Text accumulates until the server detects a sentence boundary (period, question mark, exclamation). Short fragments without punctuation wait for more text. Send `"flush": true` to force synthesis of buffered partials. **Text preprocessing:** Markdown formatting is automatically stripped before synthesis (headers, bold, italics, code blocks, links, lists, emoji). Useful when synthesizing LLM output. Pronunciation dictionary replacements are applied if `pronunciation_dict_id` is set. **Streamed vs. concatenated delivery:** Most providers (Telnyx Natural/NaturalHD/Qwen3TTS, Rime, Minimax, Resemble, Inworld) stream audio in separate frames where `text` is `null`. AWS Polly and Azure return audio in the text-bearing chunk instead. See [Messages](/docs/voice/tts/websocket-streaming/messages) for details. ### 3. Teardown Send `{"text": ""}` (empty string) to flush remaining buffered text and close gracefully. The server finishes synthesis, sends any remaining audio and a final frame, then closes the WebSocket. ``` Client → Server {"text":""} Client ← Server final audio chunks Client ← Server {"audio":null,"text":"","isFinal":true} Client ← Server [connection closed] ``` Dropping the connection without the empty-text frame works but may lose buffered text. The connection also closes on server error or inactivity timeout. See [Examples](/docs/voice/tts/websocket-streaming/examples) for complete code samples. --- ### Configuration > Source: https://developers.telnyx.com/docs/voice/tts/websocket-streaming/configuration.md ## Two Configuration Surfaces | Surface | When | What | Mutable? | |---------|------|------|----------| | Query parameters | WebSocket URL | Voice selection, audio format, sample rate, connection options | No — locked at connect | | `voice_settings` | Init frame (`{"text": " "}`) | Provider-specific tuning (speed, pitch, format, etc.) | No — locked at init | Both are one-shot. After the init frame, no configuration can change for the session. To change settings, open a new connection. ## Query Parameters Set on the URL at connect time. Immutable for the session. ### Voice Selection | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `voice` | string | — | Voice identifier in `Provider.Model.VoiceId` format. | The `voice_id` segment (third part of the `voice` string) refers to different things depending on the provider and model: | Type | Example | How you get it | |------|---------|----------------| | **Pre-built voice** | `Telnyx.NaturalHD.astra` | Browse via the [Voices API](https://developers.telnyx.com/api-reference/text-to-speech-commands/list-available-voices) or [Voice Design Lab](https://portal.telnyx.com/#/app/ai/voice-design-lab). Shipped by the provider — available to everyone. | | **Your cloned voice** | `Telnyx.Qwen3TTS.my-ceo-clone` | Create in the [Voice Design Lab](https://portal.telnyx.com/#/app/ai/voice-design-lab). Scoped to your organization — only your API key can use it. Available for Qwen3TTS and Minimax. | | **BYOK provider voice** | `elevenlabs.v3.Adam` | A voice ID from your own ElevenLabs or Resemble account. You bring your own API key; Telnyx relays the request. | The Voices API (`GET /v2/ai/tts/voices`) returns all voices available to your account — pre-built and cloned — with each voice's compound `id` ready to use as the `voice` parameter. ### Connection Options | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `language` | string | — | BCP-47 language code. Passed to the provider as `language_code`. Only used by providers that accept it (AWS Polly, Azure, ElevenLabs, Inworld). | | `text_type` | string | `text` | Text type hint: `text` or `ssml`. Only AWS Polly and Azure use this. | | `audio_format` | string | `mp3` | Output audio format: `mp3`, `linear16`, `wav`, `mulaw`, `alaw`, `ogg_vorbis`. Not all formats are supported by every provider — see providers dedicated pages. | | `sample_rate` | integer | provider default | Output sample rate in Hz. Accepted values vary by provider — see providers dedicated pages. | | `disable_cache` | boolean | `false` | Bypass the audio cache and always synthesize fresh. | ### Example ``` wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra&audio_format=linear16&disable_cache=true ``` ## Voice Settings Provider-specific tuning (speed, pitch, format, emotion, etc.) is not set via query parameters. It is passed once in the `voice_settings` object on the [initialization frame](/docs/voice/tts/websocket-streaming#2-streaming): ```json { "text": " ", "voice_settings": { "voice_speed": 1.2, "emotion": "happy" } } ``` Voice settings are applied when the synthesis worker starts and cannot be changed mid-session. **There are no common voice_settings fields.** Every field is provider-specific — the available fields, defaults, and accepted values are completely different per provider. Unrecognized fields are silently ignored. See your selected provider's page under [Providers](/docs/voice/tts/providers/telnyx/index) for the exact fields. --- ### Messages > Source: https://developers.telnyx.com/docs/voice/tts/websocket-streaming/messages.md ## Client → Server All client messages are JSON text frames. ### Text Frame Text to synthesize. `" "` (single space) for handshake. `""` (empty string) for end-of-sequence. Provider-specific voice configuration. Only used in the handshake frame (`{"text": " "}`). See [Voice Settings](/docs/voice/tts/websocket-streaming/parameters/voice-settings). When `true`, immediately synthesizes all buffered text without waiting for a sentence boundary. Default: `false`. When `true`, stops the current synthesis worker and starts a new one. The original handshake is replayed automatically. Use for barge-in/interruption. ### Message Sequence **1. Handshake** (required first message): ```json {"text": " "} ``` With optional voice settings: ```json { "text": " ", "voice_settings": { "voice_speed": 1.2 } } ``` **2. Text** (one or more): ```json {"text": "Hello, welcome to Telnyx."} ``` **3. Flush** (optional — force synthesis of buffered partial sentences): ```json {"text": "incomplete fragment", "flush": true} ``` **4. Interrupt** (optional — restart synthesis): ```json {"force": true} ``` **5. End of sequence**: ```json {"text": ""} ``` --- ## Server → Client All server messages are JSON text frames. ### Audio Chunk Returned when synthesis produces audio for a complete sentence. ```json { "audio": "", "text": "Hello, welcome to Telnyx.", "isFinal": false, "cached": false, "timeToFirstAudioFrameMs": 245 } ``` Base64-encoded audio data. `null` when the provider uses streamed delivery — audio arrives in separate streamed chunk frames instead. See note below. The text segment this audio corresponds to. `null` for streamed audio chunks. `false` for audio chunks. `true` if audio was served from cache. Time in milliseconds from speech request to first audio frame. Only present on the first chunk of each synthesis. ### Streamed Audio Chunk For providers that stream audio incrementally (Telnyx Natural, NaturalHD, Qwen3TTS, Rime, Minimax, Resemble, Inworld), audio arrives in separate frames: ```json { "audio": "", "text": null, "isFinal": false, "cached": false } ``` These contain raw audio data (`text` is always `null`). The concatenated audio chunk for these providers has `audio: null` — only the streamed chunks carry audio bytes. For AWS Polly and Azure, audio is returned in the `audio` field of the regular audio chunk frame. For all other providers, ignore the `audio` field on the text-bearing chunk and collect audio from the streamed frames. ### Final Frame Signals that synthesis is complete for the current text input: ```json { "audio": null, "text": "", "isFinal": true } ``` The connection remains open after a final frame — send more text or close. ### Error Frame ```json { "error": "Provider error message" } ``` The connection closes shortly after an error frame. --- ### Errors > Source: https://developers.telnyx.com/docs/voice/tts/websocket-streaming/errors.md ## HTTP Errors (Handshake) These occur during the WebSocket upgrade request, before the connection is established. | Code | Cause | |------|-------| | 400 | Invalid parameters — unsupported provider, missing required fields, or invalid voice format | | 401 | Missing or invalid API key | | 403 | Ultra model restricted on public WebSocket endpoint. Use [REST API](/docs/voice/tts/rest-api) for Ultra. | | 403 | Cloned voice restricted — organization requires identity verification for cloned voices (Qwen3TTS, Minimax clones) | ## WebSocket Errors (Runtime) After the connection is established, errors arrive as JSON frames: ```json { "error": "Error in audio response" } ``` The connection closes shortly after an error frame. ### Error Messages | Error | Cause | |-------|-------| | `"Error in audio response"` | The TTS provider returned an error during synthesis | | `"Error in remaining audio response"` | Provider error while synthesizing buffered text during connection close | ## Troubleshooting | Symptom | Cause | Fix | |---------|-------|-----| | Connection rejected (400) | Invalid voice format | Use `Provider.Model.VoiceId` format (e.g., `Telnyx.NaturalHD.astra`) | | Connection rejected (401) | Missing auth | Pass `Authorization: Bearer ` header during WebSocket upgrade | | No audio after connecting | Missing handshake | Send `{"text": " "}` as first frame | | `audio` field is `null` | Expected behavior | For streamed providers (Telnyx, Rime, Minimax, Resemble, Inworld), audio arrives in separate streamed frames | | Text sent but no response | Sentence buffering | Text is buffered until a sentence boundary. Send more text, use `flush: true`, or end with punctuation | | Ultra not working on WebSocket | Intentional restriction | Ultra is REST-only. Use `POST /v2/text-to-speech/speech` | | Cloned voice rejected | Identity verification required | Complete L2 verification in the [Telnyx Portal](https://portal.telnyx.com) | --- ### Examples > Source: https://developers.telnyx.com/docs/voice/tts/websocket-streaming/examples.md ## Basic Streaming ```python Python import asyncio import json import base64 import websockets async def tts_stream(): url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra" headers = {"Authorization": "Bearer YOUR_API_KEY"} async with websockets.connect(url, extra_headers=headers) as ws: # 1. Handshake await ws.send(json.dumps({"text": " "})) # 2. Send text await ws.send(json.dumps({"text": "Hello from Telnyx text-to-speech."})) # 3. Signal end of input await ws.send(json.dumps({"text": ""})) # 4. Collect audio audio_chunks = [] async for message in ws: data = json.loads(message) if data.get("error"): print(f"Error: {data['error']}") break if data.get("audio"): audio_chunks.append(base64.b64decode(data["audio"])) if data.get("isFinal"): break # Save audio with open("output.mp3", "wb") as f: for chunk in audio_chunks: f.write(chunk) asyncio.run(tts_stream()) ``` ```javascript JavaScript const WebSocket = require('ws'); const fs = require('fs'); const url = 'wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra'; const ws = new WebSocket(url, { headers: { 'Authorization': 'Bearer YOUR_API_KEY' } }); const audioChunks = []; ws.on('open', () => { // 1. Handshake ws.send(JSON.stringify({ text: ' ' })); // 2. Send text ws.send(JSON.stringify({ text: 'Hello from Telnyx text-to-speech.' })); // 3. Signal end of input ws.send(JSON.stringify({ text: '' })); }); ws.on('message', (raw) => { const data = JSON.parse(raw); if (data.error) { console.error('Error:', data.error); ws.close(); return; } if (data.audio) { audioChunks.push(Buffer.from(data.audio, 'base64')); } if (data.isFinal) { fs.writeFileSync('output.mp3', Buffer.concat(audioChunks)); ws.close(); } }); ``` ## Conversational (Barge-In) Send multiple text segments and interrupt mid-synthesis: ```python import asyncio import json import base64 import websockets async def conversational_tts(): url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra" headers = {"Authorization": "Bearer YOUR_API_KEY"} async with websockets.connect(url, extra_headers=headers) as ws: # Handshake with voice settings await ws.send(json.dumps({ "text": " ", "voice_settings": {"voice_speed": 1.1} })) # Send first sentence await ws.send(json.dumps({"text": "Welcome to the demo."})) # Wait for first audio, then interrupt async for message in ws: data = json.loads(message) if data.get("isFinal"): break # Interrupt and send new text await ws.send(json.dumps({"force": true})) await ws.send(json.dumps({"text": "Actually, let me start over."})) # Collect remaining audio... await ws.send(json.dumps({"text": ""})) async for message in ws: data = json.loads(message) if data.get("isFinal"): break asyncio.run(conversational_tts()) ``` ## LLM Token Streaming Stream tokens from an LLM directly to TTS. The server buffers text and synthesizes at sentence boundaries: ```python import asyncio import json import websockets async def llm_to_tts(llm_token_stream): url = "wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra" headers = {"Authorization": "Bearer YOUR_API_KEY"} async with websockets.connect(url, extra_headers=headers) as ws: await ws.send(json.dumps({"text": " "})) # Stream LLM tokens directly — TTS handles sentence buffering for token in llm_token_stream: await ws.send(json.dumps({"text": token})) # Done — flush remaining await ws.send(json.dumps({"text": ""})) ``` Markdown in LLM output is automatically stripped before synthesis — headers, bold, italics, code blocks, and links are converted to plain text. --- ## REST API ### Overview > Source: https://developers.telnyx.com/docs/voice/tts/rest-api.md ## How It Works You send text in, audio streams back over the same HTTP connection. No polling, no callbacks. The response uses HTTP chunked transfer encoding — audio chunks arrive as they're synthesized. Your client can begin playback immediately without waiting for the full file. The connection stays open until synthesis completes or 30 seconds pass with no new chunks. This makes REST suitable for real-time playback, not just batch file generation. For multi-turn conversational use cases where you're continuously feeding text, use [WebSocket Streaming](/docs/voice/tts/websocket-streaming) instead. --- ## Text Preprocessing Before synthesis, text passes through two stages: 1. **Markdown stripping** — headers, bold, italics, code blocks, links, lists, emoji are converted to plain text. 2. **Pronunciation dictionary** — if `pronunciation_dict_id` is set, custom word replacements are applied. --- ## API Reference The full OpenAPI spec for these endpoints is available in the auto-generated [API Reference](/docs/voice/tts/rest-api/api-reference). Note: the OAS is currently being cleaned up — some fields and provider-specific schemas may be incomplete. --- ### Request > Source: https://developers.telnyx.com/docs/voice/tts/rest-api/request.md ## Endpoint ``` POST https://api.telnyx.com/v2/text-to-speech ``` ## Example ```bash curl --request POST \ --url https://api.telnyx.com/v2/text-to-speech \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "text": "Hello from Telnyx text-to-speech.", "voice": "Telnyx.NaturalHD.astra" }' ``` ## Request Body | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `text` | string | Yes | — | Text to synthesize. Markdown is automatically stripped. | | `voice` | string | Yes | — | Dot-separated voice identifier. Format: `Provider.Model.VoiceId` (e.g., `Telnyx.NaturalHD.astra`) or `Provider.VoiceId` when the provider has a single model. | | `output_type` | string | No | `binary_output` | Response format: `binary_output`, `base64_output`, or `audio_id`. | | `language` | string | No | — | BCP-47 language code (e.g., `en-US`). Supported by AWS Polly, Azure, ElevenLabs, and Inworld. Ignored by other providers. | | `text_type` | string | No | `text` | `text` or `ssml`. SSML is supported by AWS Polly and Azure. Ultra has its own [SSML emotion syntax](/docs/voice/tts/providers/telnyx/ultra#ssml-emotions). | | `voice_settings` | object | No | — | Provider-specific tuning (speed, pitch, format, emotion). Fields vary by provider — see individual [provider pages](/docs/voice/tts/providers/telnyx). | | `pronunciation_dict_id` | string | No | — | UUID of a custom pronunciation dictionary. Word replacements are applied before synthesis. | | `disable_cache` | boolean | No | `false` | Bypass the audio cache and always synthesize fresh. | --- ### Response > Source: https://developers.telnyx.com/docs/voice/tts/rest-api/response.md The `output_type` request field controls what comes back. ## Streaming Audio (default) With `output_type: "binary_output"` (or omitted), the response is raw audio over HTTP chunked transfer encoding: ``` HTTP/1.1 200 OK Content-Type: audio/mpeg Transfer-Encoding: chunked ... ``` Start reading the body immediately — don't buffer the full response. ## Base64 With `output_type: "base64_output"`, the full audio is returned as a JSON payload after synthesis completes: ```json {"base64_audio": ""} ``` No streaming — the entire file must synthesize before the response is sent. ## Async (audio_id) With `output_type: "audio_id"`, synthesis runs in the background. You get a URL back immediately: ```json {"audio_url": "https://api.telnyx.com/v2/text-to-speech/speech/"} ``` Retrieve the audio later with `GET /v2/text-to-speech/speech/:audio_id`. If the audio is still synthesizing, the GET response itself streams chunks as they become available. --- ### Examples > Source: https://developers.telnyx.com/docs/voice/tts/rest-api/examples.md ## OpenAI SDK Compatibility The REST endpoint is a drop-in replacement for the OpenAI Audio API: ```python from openai import OpenAI client = OpenAI( api_key="YOUR_TELNYX_API_KEY", base_url="https://api.telnyx.com/v2" ) response = client.audio.speech.create( model="tts-1-hd", voice="astra", input="Hello from Telnyx." ) response.stream_to_file("output.mp3") ``` --- ### API Reference > Source: https://developers.telnyx.com/docs/voice/tts/rest-api/api-reference.md See the full API reference: [Generate Speech from Text](/api-reference/text-to-speech-commands/generate-speech-from-text). --- ## Providers ### Overview > Source: https://developers.telnyx.com/docs/voice/tts/providers/telnyx.md ## Selecting Telnyx Telnyx is the **default provider**. If you don't specify a provider, you get Telnyx. **WebSocket:** ``` wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra ``` **REST:** ```bash curl --request POST \ --url https://api.telnyx.com/v2/text-to-speech \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "text": "Hello.", "voice": "Telnyx.NaturalHD.astra" }' ``` --- ## Models | Model | Latency | Quality | Languages | Voice Source | WebSocket | REST | |-------|---------|---------|-----------|-------------|-----------|------| | [Natural](/docs/voice/tts/providers/telnyx/natural) | Low | Good | English | Pre-built (Rime Mist) | Yes | Yes | | [NaturalHD](/docs/voice/tts/providers/telnyx/naturalhd) | Low | Better | 9 languages | Pre-built (Rime Arcana) | Yes | Yes | | [KokoroTTS](/docs/voice/tts/providers/telnyx/kokoro) | Lowest | Good | 5 languages | Pre-built | Yes | Yes | | [Qwen3TTS](/docs/voice/tts/providers/telnyx/qwen3) | Medium | High | 11 languages | Cloned (Voice Design Lab) | Yes | Yes | | [Ultra](/docs/voice/tts/providers/telnyx/ultra) | Lowest | Highest | 44 languages | Pre-built | **No** | Yes | | [Grok](/docs/voice/tts/providers/telnyx/grok) | Higher | High | 20+ languages | Pre-built | Voice AI | Yes | **Ultra** is REST-only, not available over public WebSocket. Grok is available for Voice AI Assistants and direct REST TTS calls. --- Browse all available voices via the [Voices API](https://developers.telnyx.com/api-reference/text-to-speech-commands/list-available-voices) or the [Voice Design Lab](https://portal.telnyx.com/#/app/ai/voice-design-lab). --- ### Natural > Source: https://developers.telnyx.com/docs/voice/tts/providers/telnyx/natural.md **Voice format:** `Telnyx.Natural.` Pre-built English voices backed by Rime Mist. ## Voice Samples | Voice | Gender | Sample | |-------|--------|--------| | `Telnyx.Natural.allison` | Female | | | `Telnyx.Natural.brook` | Female | | --- ## WebSocket ### Query Parameters ``` wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.Natural.allison&audio_format=mp3 ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`. | | `sample_rate` | integer | `24000` | 8000, 16000, 22050, 24000, 44100, 48000, 96000. | ### Voice Settings Send in the init frame (`{"text": " "}`): | Field | Type | Default | Description | |-------|------|---------|-------------| | `voice_speed` | float | `1.0` | Speech rate multiplier. | ```json { "text": " ", "voice_settings": { "voice_speed": 1.2 } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `voice_speed` | float | `1.0` | Speech rate multiplier. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes with `Content-Type: audio/mpeg` (or `audio/wav`, `audio/pcm`). With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### NaturalHD > Source: https://developers.telnyx.com/docs/voice/tts/providers/telnyx/naturalhd.md **Voice format:** `Telnyx.NaturalHD.` Pre-built voices backed by Rime Arcana. 9 languages: en, fr, de, es, ar, hi, ja, he, pt. ## Voice Samples | Voice | Language | Gender | Sample | |-------|----------|--------|--------| | `Telnyx.NaturalHD.astra` | en-US | Female | | | `Telnyx.NaturalHD.albion` | en-US | Male | | | `Telnyx.NaturalHD.amarante` | fr-FR | Female | | | `Telnyx.NaturalHD.luna` | en-US | Female | | --- ## WebSocket ### Query Parameters ``` wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.NaturalHD.astra&audio_format=mp3 ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`. | | `sample_rate` | integer | `24000` | 8000, 16000, 22050, 24000, 44100, 48000, 96000. | ### Voice Settings Send in the init frame (`{"text": " "}`): | Field | Type | Default | Description | |-------|------|---------|-------------| | `voice_speed` | float | `1.0` | Speech rate multiplier. | ```json { "text": " ", "voice_settings": { "voice_speed": 0.9 } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `voice_speed` | float | `1.0` | Speech rate multiplier. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes with `Content-Type: audio/mpeg` (or `audio/wav`, `audio/pcm`). With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### KokoroTTS > Source: https://developers.telnyx.com/docs/voice/tts/providers/telnyx/kokoro.md **Voice format:** `Telnyx.KokoroTTS.` Lightweight, lowest-latency model. 5 languages: en, es, fr, it, pt. ## Voice Samples | Voice | Language | Gender | Sample | |-------|----------|--------|--------| | `Telnyx.KokoroTTS.af_heart` | en-US | Female | | | `Telnyx.KokoroTTS.am_adam` | en-US | Male | | | `Telnyx.KokoroTTS.bf_emma` | en-UK | Female | | --- ## WebSocket ### Query Parameters ``` wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.KokoroTTS.af_heart ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`. | | `sample_rate` | integer | `24000` | 24000. | ### Voice Settings None. All synthesis parameters are fixed. The init frame only needs `{"text": " "}`. ## REST API ### Fields No model-specific fields. Audio format is always MP3. | Field | Type | Default | Description | |-------|------|---------|-------------| | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes with `Content-Type: audio/mpeg`. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### Qwen3TTS > Source: https://developers.telnyx.com/docs/voice/tts/providers/telnyx/qwen3.md **Voice format:** `Telnyx.Qwen3TTS.` Voice cloning model. 11 languages: en, zh, fr, de, it, ja, ko, pt, ru, es, ar. The `voice_id` is the name of a clone you created in the [Voice Design Lab](https://portal.telnyx.com/#/app/ai/voice-design-lab). Clones are scoped to your organization. ## Voice Samples | Voice | Gender | Sample | |-------|--------|--------| | `Telnyx.Qwen3TTS.Delta` | Female | | | `Telnyx.Qwen3TTS.Whiskey` | Male | | --- ## WebSocket ### Query Parameters ``` wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.Qwen3TTS.Delta ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`. | | `sample_rate` | integer | `24000` | 24000. | ### Voice Settings Send in the init frame (`{"text": " "}`): | Field | Type | Default | Description | |-------|------|---------|-------------| | `language_boost` | string | — | Target language hint: `Auto`, `English`, `Chinese`, `French`, `German`, `Italian`, `Japanese`, `Korean`, `Portuguese`, `Russian`, `Spanish`, or ISO codes. | | `force_xvector` | boolean | `false` | Force x-vector voice embedding. | ```json { "text": " ", "voice_settings": { "language_boost": "English" } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `language_boost` | string | — | Target language hint. | | `force_xvector` | boolean | `false` | Force x-vector voice embedding. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked PCM audio bytes. Always 24kHz signed 16-bit LE mono. With `output_type: "base64_output"`: JSON with base64-encoded PCM. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### Ultra > Source: https://developers.telnyx.com/docs/voice/tts/providers/telnyx/ultra.md **Voice format:** `Telnyx.Ultra.` Sub-100ms latency. 36 languages. **REST only** — Ultra is not available over public WebSocket. ## Voice Samples | Voice | Language | Gender | Sample | |-------|----------|--------|--------| | `Telnyx.Ultra.Asher` | en | Male | | | `Telnyx.Ultra.Callie` | en | Female | | | `Telnyx.Ultra.Clara` | en-US | Female | | --- ## SSML Emotions Ultra supports inline SSML emotion tags. Place the tag before the text: ``` Great news — your order shipped early! ``` **Primary emotions:** `angry`, `excited`, `content`, `sad`, `scared`. **Additional:** `happy`, `enthusiastic`, `curious`, `calm`, `grateful`, `affectionate`, `sarcastic`, `surprised`, `confident`, `hesitant`, `apologetic`, `determined`, `frustrated`, `disappointed`. Omitting the tag = neutral delivery. Use sparingly — Ultra interprets emotional subtext from the text itself. ## Nonverbal Cues Insert `[laughter]` inline for natural laughing: ``` That's hilarious! [laughter] Anyway, let me check your account. ``` --- ## Language Support Set `language_boost` to improve pronunciation for the target language: Arabic, Bengali, Bulgarian, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Gujarati, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Marathi, Māori, Norwegian, Polish, Portuguese, Punjabi, Romanian, Russian, Slovak, Spanish, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Vietnamese. --- ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `voice_speed` | float | `1.0` | Speech rate multiplier. | | `language_boost` | string | — | Target language hint. | | `volume` | float | — | Output volume. | | `emotion` | string | — | `neutral`, `happy`, `sad`, `angry`, `fearful`, `disgusted`, `surprised`. | | `sampling_rate` | integer | — | Output sample rate in Hz. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes with `Content-Type: audio/mpeg`. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. ## See also xAI Grok is the second TTS provider supporting Expressive Mode. For Grok voice options, see [Grok Voices](/docs/voice/tts/providers/telnyx/grok). Note: Grok voices have higher latency than Ultra. --- ### Grok > Source: https://developers.telnyx.com/docs/voice/tts/providers/telnyx/grok.md xAI Grok voices are expressive text-to-speech voices for Voice AI Assistants. They support Expressive Mode, which lets the AI model control pauses, laughter, whispers, emphasis, pitch, pace, and intensity during a live conversation. **Higher latency**: Grok voices have higher latency than Ultra. For latency-sensitive applications that need sub-100ms time to first byte, use [Ultra](/docs/voice/tts/providers/telnyx/ultra). ## What makes Grok voices different | Feature | Ultra | Grok | |---------|-------|------| | **Expressive Mode** | SSML emotion tags and `[laughter]` | xAI speech tags for pauses, vocal sounds, and delivery style | | **Voice format** | `Telnyx.Ultra.` | `xAI.` | | **Voices** | Multiple Ultra voices | `ara`, `eve`, `leo`, `rex`, `sal` | | **Language handling** | Language hinting with `language_boost` | `auto` language detection or explicit language code | | **Streaming output** | REST only | Voice AI media streaming | ## Voice format For AI Assistants, Grok voices use the format: ``` xAI. ``` Examples: ``` xAI.eve xAI.ara xAI.leo xAI.rex xAI.sal ``` ## Voices | Voice | Voice ID | Use for | |-------|----------|---------| | Ara | `ara` | Warm, conversational assistant experiences | | Eve | `eve` | General-purpose voice assistant experiences | | Leo | `leo` | Confident, direct interactions | | Rex | `rex` | Characterful or energetic interactions | | Sal | `sal` | Distinctive conversational tone | ## Expressive Mode for AI Assistants When using Grok voices with [AI Assistants](/docs/inference/ai-assistants/no-code-voice-assistant), you can enable **Expressive Mode**. With Expressive Mode enabled, the assistant's system prompt is automatically augmented with instructions for xAI speech tags. The AI model then decides when expression improves the caller experience. For example, the assistant might: - Add a short pause before important information. - Use a softer delivery for sensitive support moments. - Laugh or chuckle naturally when the conversation calls for it. - Emphasize appointment times, confirmation numbers, or next steps. - Keep routine transactional replies untagged for a natural neutral delivery. Use expressive tags sparingly. The goal is natural delivery, not tagging every sentence. ### Enable in the portal 1. Go to your assistant in the [Telnyx Portal](https://portal.telnyx.com/#/app/ai/assistants). 2. Under **Voice Settings**, select an xAI Grok voice. 3. Toggle **Expressive Mode** on. 4. Save your assistant. ### Enable via API Set `expressive_mode: true` in your assistant's `voice_settings`: ```bash curl -X PATCH "https://api.telnyx.com/v2/ai/assistants/YOUR_ASSISTANT_ID" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "voice_settings": { "voice": "xAI.eve", "expressive_mode": true } }' ``` ## xAI speech tag reference When Expressive Mode is enabled, the assistant can use these speech tags in responses. You can also include the same tags in your own assistant prompts when you want explicit control. ### Inline tags Place inline tags at the exact point where the vocal expression should happen. | Tag | Use for | |-----|---------| | `[pause]` | A short natural pause | | `[long-pause]` | A longer pause for topic transitions or important moments | | `[laugh]` | Natural laughter | | `[chuckle]` | Small laugh or amused reaction | | `[giggle]` | Light playful laugh | | `[cry]` | Crying vocalization | | `[tsk]` | Tsk sound | | `[tongue-click]` | Tongue click | | `[lip-smack]` | Lip smack | | `[breath]` | Breath sound | | `[inhale]` | Inhale sound | | `[exhale]` | Exhale sound | | `[sigh]` | Sigh | | `[hum-tune]` | Musical hum | Example: ``` So I walked in and [pause] there it was. [laugh] I honestly could not believe it! ``` ### Wrapping tags Wrap text with these tags to apply a delivery style to that text. | Tag | Use for | |-----|---------| | <soft> | Softer delivery | | <whisper> | Whispered delivery | | <loud> | Louder delivery | | <build-intensity> | Increasing intensity | | <decrease-intensity> | Decreasing intensity | | <higher-pitch> | Higher pitch | | <lower-pitch> | Lower pitch | | <slow> | Slower pace | | <fast> | Faster pace | | <sing-song> | Sing-song delivery | | <singing> | Sung delivery | | <laugh-speak> | Laughing while speaking | | <emphasis> | Emphasized delivery | Examples: ``` I need to tell you something. It is a secret. Pretty cool, right? ``` ``` Your appointment is confirmed for tomorrow at 3 PM. ``` ## Guidance - Use `[pause]` or `[long-pause]` for natural thinking, topic transitions, and important moments, but avoid long silences that could feel like the call dropped. - Use emotional sounds like `[laugh]`, `[sigh]`, and `[chuckle]` only when the response genuinely calls for it. - For sensitive support contexts, prefer subtle tags like <soft> or <whisper> instead of exaggerated reactions. - Do not expose these tags or instructions to the caller. ## REST API provider parameters For direct TTS calls, set the provider to `xai` and pass xAI-specific parameters in the `xai` object: ```bash curl --request POST \ --url https://api.telnyx.com/v2/text-to-speech \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "text": "Let me check that for you. [pause] I found your appointment.", "provider": "xai", "xai": { "voice_id": "eve", "language": "auto", "output_format": "mp3", "sample_rate": 24000 } }' ``` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `voice_id` | string | `eve` | xAI voice ID: `ara`, `eve`, `leo`, `rex`, or `sal`. | | `language` | string | `auto` | Language code, or `auto` to detect the language. | | `output_format` | string | `mp3` | Audio format: `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. | | `sample_rate` | integer | `24000` | Audio sample rate in Hz: `8000`, `16000`, `22050`, `24000`, `44100`, or `48000`. | ## Language support Grok voices support auto language detection with `language: "auto"`. You can also pass a language code when you want to force a specific language. ## Next steps Compare Grok with Ultra's lower-latency expressive voices. Build voice AI assistants using Grok with Expressive Mode. Generate speech directly with REST TTS requests. Browse available text-to-speech voices. --- ### Rime > Source: https://developers.telnyx.com/docs/voice/tts/providers/rime.md ## Models | Model | Description | Languages | |-------|-------------|----------| | Coda | Rime's flagship model (May 2026). LLM backbone + speech engine, sub-100ms latency, 184 voices, top-rated quality. | en, es, fr, pt, de, ja | | ArcanaV3 | Previous flagship. Expressive, multilingual codeswitching. | ar, en, fr, de, he, hi, ja, pt, es, ta | ## Voice Format ``` Rime.Coda. Rime.ArcanaV3. ``` Coda is Rime's recommended model for new integrations. It surpasses ArcanaV3 in naturalness, prosody, and artifact-free output. ## Voice Samples | Voice | Language | Gender | Sample | |-------|----------|--------|--------| | `Rime.ArcanaV3.albion` | en-US | Male | | | `Rime.ArcanaV3.arcade` | en-US | Male | | --- ## WebSocket ### Query Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`. | | `sample_rate` | integer | `24000` | 8000, 16000, 22050, 24000, 44100, 48000, 96000. | ### Voice Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `voice_speed` | float | `1.0` | Speech rate multiplier. | ```json { "text": " ", "voice_settings": { "voice_speed": 0.9 } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `voice_speed` | float | `1.0` | Speech rate multiplier. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### Minimax > Source: https://developers.telnyx.com/docs/voice/tts/providers/minimax.md **Voice format:** `minimax..` `voice_id` can be a **system voice** (pre-built) or a **cloned voice** from the [Voice Design Lab](https://portal.telnyx.com/#/app/ai/voice-design-lab) (organization-scoped). ## Voice Samples | Voice | Gender | Sample | |-------|--------|--------| | `Minimax.speech-2.8-turbo.English_expressive_narrator` | Male | | | `Minimax.speech-2.8-turbo.English_radiant_girl` | Female | | --- ## WebSocket ### Query Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`. | | `sample_rate` | integer | `24000` | 8000, 16000, 22050, 24000, 32000, 44100. | ### Voice Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `speed` | float | — | Playback speed multiplier. | | `vol` | float | — | Volume level. | | `pitch` | integer | — | Pitch adjustment. | | `language_boost` | string | — | Language emphasis for multilingual synthesis. | ```json { "text": " ", "voice_settings": { "speed": 1.1, "vol": 1.0, "pitch": 0 } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `speed` | float | — | Playback speed multiplier. | | `vol` | float | — | Volume level. | | `pitch` | integer | — | Pitch adjustment. | | `language_boost` | string | — | Language emphasis. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### Resemble > Source: https://developers.telnyx.com/docs/voice/tts/providers/resemble.md **Voice format:** `resemble.Turbo.` Default model: `Turbo`. `voice_id` is a voice from **your own Resemble account**. ## Voice Samples | Voice | Language | Gender | Sample | |-------|----------|--------|--------| | `Resemble.Turbo.Aaron_en-US` | en-US | Male | | | `Resemble.Turbo.Amelia_en-US` | en-US | Female | | --- ## WebSocket ### Query Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `wav`. | | `sample_rate` | integer | `48000` | 8000, 16000, 22050, 32000, 44100, 48000. | ### Voice Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `format` | string | `mp3` | `mp3` or `wav`. | | `precision` | string | `PCM_32` | `PCM_16`, `PCM_24`, `PCM_32`, `MULAW`. | | `sample_rate` | string | `48000` (mp3) / `16000` (wav) | `8000`, `16000`, `22050`, `32000`, `44100`, `48000`. Default depends on format. | ```json { "text": " ", "voice_settings": { "format": "wav", "precision": "PCM_16", "sample_rate": "22050" } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `format` | string | `mp3` | `mp3` or `wav`. | | `precision` | string | `PCM_32` | `PCM_16`, `PCM_24`, `PCM_32`, `MULAW`. | | `sample_rate` | string | `48000` / `16000` | Sample rate. Default depends on format. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### Inworld > Source: https://developers.telnyx.com/docs/voice/tts/providers/inworld.md **Voice format:** `inworld..` **Models:** `inworld-tts-1.5-mini` (alias `Mini` — faster) and `inworld-tts-1.5-max` (alias `Max` — higher quality). Defaults to `mini`. ## Voice Samples | Voice | Model | Gender | Sample | |-------|-------|--------|--------| | `Inworld.Max.Hank` | Max | Male | | | `Inworld.Mini.Loretta` | Mini | Female | | --- ## WebSocket ### Query Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`. | | `sample_rate` | integer | `24000` | 8000, 16000, 22050, 24000, 44100, 48000. | | `language` | string | — | BCP-47 language code. | ### Voice Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `encoding` | string | `MP3` | `MP3` or `LINEAR16`. | | `sample_rate` | integer | `24000` | Output sample rate in Hz. | | `language_code` | string | — | BCP-47. Overrides `language` query param. | ```json { "text": " ", "voice_settings": { "encoding": "LINEAR16", "sample_rate": 16000 } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `encoding` | string | `MP3` | `MP3` or `LINEAR16`. | | `sample_rate` | integer | `24000` | Output sample rate in Hz. | | `language_code` | string | — | BCP-47 language code. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### xAI > Source: https://developers.telnyx.com/docs/voice/tts/providers/xai.md **Voice format:** `xAI.` xAI Grok voices are expressive, multilingual text-to-speech voices. They support inline speech tags for pauses, vocal sounds, emphasis, pitch, pace, and intensity. xAI Grok voices are higher-latency than [Telnyx Ultra](/docs/voice/tts/providers/telnyx/ultra). For latency-sensitive applications that need sub-100ms time to first byte, use Ultra. ## Voices | Voice | Voice ID | Use for | |-------|----------|---------| | Ara | `xAI.ara` | Warm, conversational assistant experiences | | Eve | `xAI.eve` | General-purpose voice assistant experiences | | Leo | `xAI.leo` | Confident, direct interactions | | Rex | `xAI.rex` | Characterful or energetic interactions | | Sal | `xAI.sal` | Distinctive conversational tone | --- ## WebSocket xAI Grok voices are not available on the public TTS WebSocket API. Use the [REST API](/docs/voice/tts/rest-api) for direct text-to-speech generation, or use xAI Grok voices with [AI Assistants](/docs/inference/ai-assistants/no-code-voice-assistant). ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `language` | string | `auto` | Language code, or `auto` to detect automatically. | | `output_format` | string | `mp3` | `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. | | `sample_rate` | integer | `24000` | 8000, 16000, 22050, 24000, 44100, or 48000. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ```json { "text": "Let me check that for you. [pause] I found your appointment.", "voice": "xAI.eve", "voice_settings": { "language": "auto", "output_format": "mp3", "sample_rate": 24000 } } ``` ### Response Default (`binary_output`): chunked audio bytes. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. ## Expressive speech tags Use speech tags inline in `text` when you want more expressive delivery. | Tag | Use for | |-----|---------| | `[pause]` | A short natural pause | | `[long-pause]` | A longer pause for topic transitions or important moments | | `[laugh]`, `[chuckle]`, `[giggle]` | Natural laughter or amused reactions | | `[sigh]`, `[breath]`, `[inhale]`, `[exhale]` | Breath and sigh sounds | | <whisper> | Whispered delivery | | <soft> | Softer delivery | | <loud> | Louder delivery | | <emphasis> | Emphasized delivery | | <slow>, <fast> | Slower or faster pace | | <higher-pitch>, <lower-pitch> | Higher or lower pitch | ```text So I walked in and [pause] there it was. [laugh] I honestly could not believe it! ``` ```text Your appointment is confirmed for tomorrow at 3 PM. ``` Use expressive tags sparingly. The goal is natural delivery, not tagging every sentence. ## AI Assistants For AI Assistants, choose an xAI Grok voice such as `xAI.eve` and enable **Expressive Mode** to let the assistant decide when speech tags improve the caller experience. Build voice AI assistants using xAI Grok voices with Expressive Mode. Generate speech directly with REST TTS requests. --- ### AWS Polly > Source: https://developers.telnyx.com/docs/voice/tts/providers/aws.md **Voice format:** `aws.Polly..` Example: `aws.Polly.Generative.Lucia` The engine can also be parsed from a hyphenated suffix on the voice ID — e.g., `Lucia-longform` resolves to engine `long-form`. ## Voice Samples | Voice | Language | Gender | Sample | |-------|----------|--------|--------| | `aws.Polly.Danielle-Neural` | en-US | Female | | | `aws.Polly.Gregory-Neural` | en-US | Male | | | `aws.Polly.Lucia-Generative` | es-ES | Female | | --- ## WebSocket ### Query Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`, `ogg_vorbis`. | | `sample_rate` | integer | — | 8000, 16000, 22050, 24000. | | `language` | string | — | BCP-47 language code. Passed as `language_code` to Polly. | | `text_type` | string | `text` | `text` or `ssml`. Polly supports SSML for fine-grained prosody control. | ### Voice Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `engine` | string | `standard` | `standard`, `neural`, `generative`, `long-form`. | | `output_format` | string | — | Any [Polly output format](https://docs.aws.amazon.com/polly/latest/dg/API_SynthesizeSpeech.html#polly-SynthesizeSpeech-request-OutputFormat). | | `sample_rate` | string | — | e.g. `"8000"`, `"16000"`, `"22050"`, `"24000"`. Valid values depend on engine and format. | | `lexicon_names` | array | — | Pronunciation lexicon names to apply. | | `language_code` | string | — | BCP-47. Overrides `language` query param. | | `text_type` | string | `text` | `text` or `ssml`. Overrides query param. | ```json { "text": " ", "voice_settings": { "engine": "generative", "output_format": "mp3", "sample_rate": "24000" } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `engine` | string | `standard` | `standard`, `neural`, `generative`, `long-form`. | | `output_format` | string | — | Polly output format. | | `sample_rate` | string | — | Sample rate in Hz. | | `lexicon_names` | array | — | Pronunciation lexicon names. | | `language_code` | string | — | BCP-47 language code. | | `text_type` | string | `text` | `text` or `ssml`. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes. Format depends on `output_format`. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### Azure > Source: https://developers.telnyx.com/docs/voice/tts/providers/azure.md **Voice format:** `azure.` Example: `azure.en-US-AvaMultilingualNeural` No model ID segment — Azure voices are flat identifiers. Default voice: `en-US-AvaMultilingualNeural`. ## Voice Samples | Voice | Language | Gender | Sample | |-------|----------|--------|--------| | `azure.en-US-AvaMultilingualNeural` | en-US | Female | | | `azure.en-US-AndrewMultilingualNeural` | en-US | Male | | --- ## WebSocket ### Query Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `wav`, `linear16`, `mulaw`, `alaw`. | | `sample_rate` | integer | `24000` | 8000, 16000, 24000, 48000. | | `language` | string | `en-US` | BCP-47 language code. | | `text_type` | string | `text` | `text` or `ssml`. Azure supports SSML for pronunciation and prosody control. | ### Voice Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `output_format` | string | `audio-24khz-160kbitrate-mono-mp3` | See [Azure audio formats](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs). | | `language_code` | string | `en-US` | BCP-47. Overrides `language` query param. | | `text_type` | string | `text` | `text` or `ssml`. Overrides query param. | | `effect` | string | — | `eq_car`, `eq_telecomhp8k`. Audio equalization. | | `gender` | string | — | `Male`, `Female`. Voice gender filter. | ```json { "text": " ", "voice_settings": { "output_format": "audio-48khz-192kbitrate-mono-mp3", "effect": "eq_car" } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `output_format` | string | `audio-24khz-160kbitrate-mono-mp3` | Azure audio format string. | | `language_code` | string | `en-US` | BCP-47 language code. | | `text_type` | string | `text` | `text` or `ssml`. | | `effect` | string | — | `eq_car`, `eq_telecomhp8k`. | | `gender` | string | — | `Male`, `Female`. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ### ElevenLabs > Source: https://developers.telnyx.com/docs/voice/tts/providers/elevenlabs.md ElevenLabs requires your own API key configured in your Telnyx account. Telnyx relays requests to the ElevenLabs API — voice settings are passed through directly. **Voice format:** `elevenlabs..` Example: `elevenlabs.v3.Adam` `voice_id` is a voice from **your own ElevenLabs account** — pre-built, cloned, or designed. Preview voices at [elevenlabs.io/voice-library](https://elevenlabs.io/voice-library). --- ## WebSocket ### Query Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `audio_format` | string | `mp3` | `mp3`, `linear16`, `mulaw`. | | `sample_rate` | integer | — | 8000, 16000, 22050, 24000, 44100. | | `language` | string | — | BCP-47 language code. | ### Voice Settings Relayed directly to the [ElevenLabs API](https://elevenlabs.io/docs/api-reference/text-to-speech). Any field ElevenLabs accepts can be passed here. | Field | Type | Default | Description | |-------|------|---------|-------------| | `model_id` | string | — | ElevenLabs model override (e.g. `eleven_multilingual_v2`). | | `language_code` | string | — | BCP-47. Overrides `language` query param. | | `stability` | float | — | 0.0–1.0. Voice consistency. | | `similarity_boost` | float | — | 0.0–1.0. Clarity and similarity to original voice. | | `style` | float | — | 0.0–1.0. Style exaggeration. | | `use_speaker_boost` | boolean | — | Speaker boost toggle for clarity. | ```json { "text": " ", "voice_settings": { "stability": 0.5, "similarity_boost": 0.75, "style": 0.3 } } ``` ## REST API ### Fields | Field | Type | Default | Description | |-------|------|---------|-------------| | `model_id` | string | — | ElevenLabs model override. | | `stability` | float | — | 0.0–1.0. Voice consistency. | | `similarity_boost` | float | — | 0.0–1.0. Clarity and similarity. | | `style` | float | — | 0.0–1.0. Style exaggeration. | | `use_speaker_boost` | boolean | — | Speaker boost toggle. | | `output_type` | string | `binary_output` | `binary_output`, `base64_output`, or `audio_id`. | ### Response Default (`binary_output`): chunked audio bytes. Format determined by ElevenLabs. With `output_type: "base64_output"`: JSON with base64-encoded audio. With `output_type: "audio_id"`: JSON with an `audio_url` for deferred retrieval. --- ## Other ### Pronunciation Dictionaries > Source: https://developers.telnyx.com/docs/voice/tts/pronunciation-dictionaries.md Pronunciation dictionaries let you control how specific words and phrases are spoken during text-to-speech synthesis. Dictionaries are applied automatically before speech generation — no changes to your text input required. ## Item Types Each dictionary contains up to 100 items. Two types are supported: ### Alias (text replacement) Replaces matched text with alternative text before synthesis: ```json { "text": "ASAP", "type": "alias", "alias": "as soon as possible" } ``` ### Phoneme (IPA notation) Specifies exact pronunciation using the International Phonetic Alphabet: ```json { "text": "GIF", "type": "phoneme", "phoneme": "ɡɪf", "alphabet": "ipa" } ``` ## Using a Dictionary Pass the dictionary ID when synthesizing speech: **REST API:** ```bash curl --request POST \ --url https://api.telnyx.com/v2/text-to-speech \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "text": "Welcome to Telnyx.", "voice": "Telnyx.Ultra.002622d8-19d0-4567-a16a-f99c7397c062", "pronunciation_dict_id": "c215a3e1-be41-4701-97e8-1d3c22f9a5b7" }' ``` **WebSocket:** pass `pronunciation_dict_id` as a query parameter on the connection URL: ``` wss://api.telnyx.com/v2/text-to-speech/speech?voice=Telnyx.Ultra.002622d8-19d0-4567-a16a-f99c7397c062&pronunciation_dict_id=c215a3e1-be41-4701-97e8-1d3c22f9a5b7 ``` ## Managing Dictionaries ### Create a dictionary ```bash curl --request POST \ --url https://api.telnyx.com/v2/pronunciation_dicts \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "name": "My Dictionary", "items": [ { "text": "Telnyx", "type": "phoneme", "phoneme": "ˈtɛl.nɪks", "alphabet": "ipa" }, { "text": "GIF", "type": "phoneme", "phoneme": "ɡɪf", "alphabet": "ipa" }, { "text": "ASAP", "type": "alias", "alias": "as soon as possible" }, { "text": "BTW", "type": "alias", "alias": "by the way" }, { "text": "SQL", "type": "alias", "alias": "sequel" }, { "text": "meeting", "type": "alias", "alias": "3:00 PM" } ] }' ``` You can also upload a PLS/XML or plain text file via `multipart/form-data` instead of providing items as JSON. Plain text format: ``` Telnyx:/ˈtɛl.nɪks/ GIF:/ɡɪf/ ASAP=as soon as possible BTW=by the way SQL=sequel meeting=3:00 PM ``` ### List dictionaries ```bash curl --url 'https://api.telnyx.com/v2/pronunciation_dicts?page[number]=1&page[size]=20' \ --header 'Authorization: Bearer ' ``` ### Get a dictionary ```bash curl --url https://api.telnyx.com/v2/pronunciation_dicts/{id} \ --header 'Authorization: Bearer ' ``` ### Update a dictionary ```bash curl --request PATCH \ --url https://api.telnyx.com/v2/pronunciation_dicts/{id} \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "name": "Brand Names v2", "items": [ { "text": "Telnyx", "type": "alias", "alias": "tel-nicks" } ] }' ``` Updates use optimistic locking — if the dictionary was modified concurrently, the request returns `409 Conflict`. Re-fetch and retry. ### Delete a dictionary ```bash curl --request DELETE \ --url https://api.telnyx.com/v2/pronunciation_dicts/{id} \ --header 'Authorization: Bearer ' ``` ## Limits | Limit | Value | |-------|-------| | Dictionaries per organization | 50 | | Items per dictionary | 100 | | Text field (match) | 200 characters | | Alias / phoneme value | 500 characters | | File upload | 1 MB | ## File Upload Formats When creating a dictionary via file upload, two formats are supported: **PLS/XML** — standard [W3C Pronunciation Lexicon Specification](https://www.w3.org/TR/pronunciation-lexicon/) format: ```xml Telnyx tel-nicks SQL sequel IEEE I triple E nginx ɛndʒɪnɛks kubectl kuːbkʌtəl Kubernetes kuːbɚnɛtɪz ``` **Plain text** — line-based format: - `word=alias` for alias items - `word:/phoneme/` for IPA phonemes --- ### In-Call Playback > Source: https://developers.telnyx.com/docs/voice/tts/in-call-playback.md In-call TTS plays synthesized speech during live voice calls. Two integration paths: ## Voice API Use the [`speak`](/api-reference/call-commands/speak-text) command to play TTS on an active call: ```bash curl --location 'https://api.telnyx.com/v2/calls/{call_control_id}/actions/speak' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer YOUR_API_KEY' \ --data '{ "voice": "Telnyx.Ultra.3e1ed423-17e5-4773-b87c-25b031106e41" }' ``` See [Voice API docs](/docs/voice/programmable-voice) for the full command reference. ## TeXML Use the `` element: ```xml Your appointment is confirmed for tomorrow at 3 PM. ``` See [TeXML docs](/docs/voice/texml) for the full `` reference. ## AI Assistants AI Assistants use TTS for voice output. Configure the voice model in assistant settings. See [AI Assistants](/docs/inference/ai-assistants) for voice configuration. ## Voice Selection In-call TTS uses the same voice format as WebSocket and REST: ``` Provider.Model.VoiceId ``` All models (including Ultra) are available for in-call playback. --- ### Pricing > Source: https://developers.telnyx.com/docs/voice/tts/rest-api/pricing.md Pricing for WebSocket TTS varies by engine and model. Contact [sales](https://telnyx.com/contact-us) or check the [pricing page](https://telnyx.com/pricing/text-to-speech) for current rates. --- ## API Reference (TTS) ### Text To Speech Commands - [Stream text to speech over WebSocket](https://developers.telnyx.com/api-reference/text-to-speech-commands/stream-text-to-speech-over-websocket.md): Open a WebSocket connection to stream text and receive synthesized audio in real time. Authentication is provided via the standard `Authorization: Bearer