Text-to-speech (TTS) turns written text into a natural spoken voice. A plain-language guide to the pipeline, the synthesis types, and the main use cases.

Text-to-speech (TTS) is software that converts written text into spoken audio using AI speech synthesis. A text-to-speech system first runs linguistic analysis on the text, then a neural model generates a natural-sounding voice, usually delivered through an API. Text-to-speech powers screen readers, voice assistants, IVR, audiobooks, and voice AI agents. It is the reverse of speech-to-text.
TTS stands for text-to-speech. The full form of TTS is text-to-speech, sometimes written text to speech or speech synthesis.
Text-to-speech is the conversion of written text into an audible, spoken voice by a computer. It is the technology behind a screen reader speaking a web page aloud and a voice assistant answering a question.
This page is about text-to-speech. The other common senses of "TTS" are distinct:
Related articles
In computing, networking, and AI, TTS always means text-to-speech.
A text-to-speech system converts text into audio in two stages: linguistic analysis, then speech synthesis. The first stage decides what to say and how it should sound. The second stage generates the actual sound.
Linguistic analysis turns raw text into a phonetic and prosodic plan. The system tokenizes the text, normalizes it (expanding "Dr." to "doctor" and "2026" to "twenty twenty-six"), converts words to phonemes through grapheme-to-phoneme mapping, and predicts prosody: the rhythm, stress, and intonation of each phrase.
Developers can override parts of this stage with SSML, a markup language that controls pronunciation, pauses, emphasis, and speaking rate. SSML is how an application forces a brand name or an acronym to be pronounced correctly.
Speech synthesis converts the phonetic and prosodic plan into audio in two steps. An acoustic model maps the plan to time-aligned acoustic features, typically a mel-spectrogram that represents how frequencies vary over time. A vocoder then converts those features into a raw audio waveform a device can play.
Modern neural vocoders generate the waveform directly rather than assembling it from stored fragments, which is why current synthetic voices sound close to human.
Text-to-speech is older than the computer. The Voder, built by Homer Dudley at Bell Telephone Laboratories and demonstrated at the 1939 New York World's Fair, is widely credited as the first electronic speech synthesizer, played live by a human operator.
Computer-driven synthesis followed. Noriko Umeda and colleagues at Japan's Electrotechnical Laboratory built what is widely credited as the first full English text-to-speech system in 1968. Through the 1980s, formant synthesizers such as DECtalk produced the recognizably robotic voice of early assistive technology, including the synthetic voice associated with Stephen Hawking.
Quality improved as the underlying method changed. Concatenative synthesis, which stitches together recorded fragments of human speech, became the dominant approach for natural output in the 1990s. Statistical parametric synthesis, led by HMM-based systems, made voices smaller and more flexible through the 2000s.
The neural era began in 2016, when DeepMind's WaveNet generated raw audio waveforms directly and set the template for the natural voices in use today.
Text-to-speech uses one of three synthesis approaches: concatenative, parametric, and neural. They differ in how the audio is generated and in how natural the result sounds.
| Approach | How it works | Trade-off |
|---|---|---|
| Concatenative | Stitches together pre-recorded fragments of human speech | Clear but choppy; large voice databases; hard to change style |
| Parametric (statistical) | Generates speech from a statistical model of acoustic parameters | Flexible and compact; sounds buzzy and synthetic |
| Neural | Deep neural networks model the waveform directly from text | Most natural and expressive; higher compute cost |
Neural synthesis is the current standard. It learns the relationship between text and sound directly from large speech datasets, which is why its output is more natural and expressive than the older approaches.
For how modern neural voices work, see neural text-to-speech. For the older method it replaced, see concatenative synthesis. The component that maps text to acoustic features is the acoustic model.
Text-to-speech and speech-to-text are inverse technologies. Text-to-speech turns written text into spoken audio. Speech-to-text, also called speech recognition, turns spoken audio into written text.
The two are complementary, not competing. A voice AI agent uses speech-to-text to understand a caller, a language model to decide what to say, and text-to-speech to speak the reply. Conversational AI exists because speech recognition and speech synthesis improved in parallel.
A simple test: if the input is text and the output is sound, it is text-to-speech. If the input is sound and the output is text, it is speech-to-text.
Text-to-speech began as accessibility technology and now powers any product that needs a spoken interface. Its original purpose was reading digital text aloud for people with visual impairments or reading disabilities.
The main applications today:
What does TTS stand for? TTS stands for text-to-speech. The full form of TTS is text-to-speech, the conversion of written text into spoken audio by a computer using speech synthesis. In apparel "TTS" can mean "true to size," but in computing and AI it always means text-to-speech.
Is text-to-speech the same as speech-to-text? No. Text-to-speech converts written text into spoken audio. Speech-to-text, also called speech recognition, does the reverse and converts spoken audio into written text. They are complementary technologies that combine in conversational AI: speech-to-text understands the user, text-to-speech speaks the response.
How does text-to-speech work? A text-to-speech system works in two stages. Linguistic analysis normalizes the text, converts words to phonemes, and predicts prosody. Speech synthesis then uses an acoustic model to produce acoustic features and a vocoder to generate the final audio waveform from those features.
What is text-to-speech used for? Text-to-speech is used for screen readers and accessibility, voice AI agents and IVR phone systems, audiobook and podcast narration, e-learning, turn-by-turn navigation, media and game voices, and healthcare reminders. It started as assistive technology and expanded as synthetic voices became natural enough for production use.
Is text-to-speech artificial intelligence? Modern text-to-speech is a form of artificial intelligence. Current systems use deep neural networks trained on large speech datasets to generate natural voices. Older concatenative and parametric methods were not AI in the modern sense, but neural text-to-speech, the current standard, is a machine learning application.
What is neural TTS? Neural TTS is text-to-speech that uses deep neural networks to model speech directly from text, instead of stitching recorded fragments or using a statistical model. It produces the most natural and expressive synthetic voices and is the basis of nearly every modern text-to-speech system.
What does TTS mean in chat or on Twitch? On Twitch, Discord, and other chat platforms, "TTS" still means text-to-speech. Streamers use a text-to-speech feature to read viewer chat messages or donation messages aloud on stream, so the underlying technology is the same as everywhere else in computing.
Text-to-speech turns written text into a natural spoken voice through a two-stage pipeline of linguistic analysis and neural synthesis, and it underpins accessibility, voice AI agents, and audio media. Understanding the pipeline is the first step; the next is choosing how to generate the voice in production.
The Telnyx TTS API delivers text-to-speech through one API to leading voice engines, hosted on Telnyx edge infrastructure for real-time voice AI. One API, leading voice engines, edge-hosted for performance.