Conversational AI

What Is TTS (Text-to-Speech)? Definition, How It Works, and Uses

Text-to-speech (TTS) turns written text into a natural spoken voice. A plain-language guide to the pipeline, the synthesis types, and the main use cases.

By Osman Husain

What is TTS (text-to-speech)?


Text-to-speech (TTS) is software that converts written text into spoken audio using AI speech synthesis. A text-to-speech system first runs linguistic analysis on the text, then a neural model generates a natural-sounding voice, usually delivered through an API. Text-to-speech powers screen readers, voice assistants, IVR, audiobooks, and voice AI agents. It is the reverse of speech-to-text.

What does TTS stand for?

TTS stands for text-to-speech. The full form of TTS is text-to-speech, sometimes written text to speech or speech synthesis.

Text-to-speech is the conversion of written text into an audible, spoken voice by a computer. It is the technology behind a screen reader speaking a web page aloud and a voice assistant answering a question.

This page is about text-to-speech. The other common senses of "TTS" are distinct:

  • Apparel and footwear: "TTS" means "true to size," a fit description, not a technology.
  • Chat and streaming: on Twitch and Discord, "TTS" is still text-to-speech, used to read chat messages aloud on stream.

In computing, networking, and AI, TTS always means text-to-speech.

How does text-to-speech work?

A text-to-speech system converts text into audio in two stages: linguistic analysis, then speech synthesis. The first stage decides what to say and how it should sound. The second stage generates the actual sound.

Linguistic analysis

Linguistic analysis turns raw text into a phonetic and prosodic plan. The system tokenizes the text, normalizes it (expanding "Dr." to "doctor" and "2026" to "twenty twenty-six"), converts words to phonemes through grapheme-to-phoneme mapping, and predicts prosody: the rhythm, stress, and intonation of each phrase.

Developers can override parts of this stage with SSML, a markup language that controls pronunciation, pauses, emphasis, and speaking rate. SSML is how an application forces a brand name or an acronym to be pronounced correctly.

Speech synthesis

Speech synthesis converts the phonetic and prosodic plan into audio in two steps. An acoustic model maps the plan to time-aligned acoustic features, typically a mel-spectrogram that represents how frequencies vary over time. A vocoder then converts those features into a raw audio waveform a device can play.

Modern neural vocoders generate the waveform directly rather than assembling it from stored fragments, which is why current synthetic voices sound close to human.

Text-to-speech pipeline diagram, text to audio

What is the history of text-to-speech?

Text-to-speech is older than the computer. The Voder, built by Homer Dudley at Bell Telephone Laboratories and demonstrated at the 1939 New York World's Fair, is widely credited as the first electronic speech synthesizer, played live by a human operator.

Computer-driven synthesis followed. Noriko Umeda and colleagues at Japan's Electrotechnical Laboratory built what is widely credited as the first full English text-to-speech system in 1968. Through the 1980s, formant synthesizers such as DECtalk produced the recognizably robotic voice of early assistive technology, including the synthetic voice associated with Stephen Hawking.

Quality improved as the underlying method changed. Concatenative synthesis, which stitches together recorded fragments of human speech, became the dominant approach for natural output in the 1990s. Statistical parametric synthesis, led by HMM-based systems, made voices smaller and more flexible through the 2000s.

The neural era began in 2016, when DeepMind's WaveNet generated raw audio waveforms directly and set the template for the natural voices in use today.

What are the types of text-to-speech?

Text-to-speech uses one of three synthesis approaches: concatenative, parametric, and neural. They differ in how the audio is generated and in how natural the result sounds.

Text-to-speech synthesis approaches

Approach How it works Trade-off
Concatenative Stitches together pre-recorded fragments of human speech Clear but choppy; large voice databases; hard to change style
Parametric (statistical) Generates speech from a statistical model of acoustic parameters Flexible and compact; sounds buzzy and synthetic
Neural Deep neural networks model the waveform directly from text Most natural and expressive; higher compute cost

Neural synthesis is the current standard. It learns the relationship between text and sound directly from large speech datasets, which is why its output is more natural and expressive than the older approaches.

For how modern neural voices work, see neural text-to-speech. For the older method it replaced, see concatenative synthesis. The component that maps text to acoustic features is the acoustic model.

What is the difference between text-to-speech and speech-to-text?

Text-to-speech and speech-to-text are inverse technologies. Text-to-speech turns written text into spoken audio. Speech-to-text, also called speech recognition, turns spoken audio into written text.

The two are complementary, not competing. A voice AI agent uses speech-to-text to understand a caller, a language model to decide what to say, and text-to-speech to speak the reply. Conversational AI exists because speech recognition and speech synthesis improved in parallel.

A simple test: if the input is text and the output is sound, it is text-to-speech. If the input is sound and the output is text, it is speech-to-text.

What is text-to-speech used for?

Text-to-speech began as accessibility technology and now powers any product that needs a spoken interface. Its original purpose was reading digital text aloud for people with visual impairments or reading disabilities.

The main applications today:

  • Accessibility: screen readers and assistive devices. The physicist Stephen Hawking used a speech synthesizer for decades, the best-known example of the technology.
  • Voice AI agents and IVR: spoken responses for voice AI agents, call deflection, and self-service phone systems.
  • Audiobooks and podcasts: machine narration of articles, books, and AI-generated audio.
  • E-learning and training: narrated lessons, courses, and instructions.
  • Navigation and announcements: turn-by-turn directions, transit and airport announcements.
  • Media and entertainment: character voices, dubbing, and game dialogue.
  • Healthcare: medication reminders and spoken patient instructions.

Frequently asked questions

What does TTS stand for? TTS stands for text-to-speech. The full form of TTS is text-to-speech, the conversion of written text into spoken audio by a computer using speech synthesis. In apparel "TTS" can mean "true to size," but in computing and AI it always means text-to-speech.

Is text-to-speech the same as speech-to-text? No. Text-to-speech converts written text into spoken audio. Speech-to-text, also called speech recognition, does the reverse and converts spoken audio into written text. They are complementary technologies that combine in conversational AI: speech-to-text understands the user, text-to-speech speaks the response.

How does text-to-speech work? A text-to-speech system works in two stages. Linguistic analysis normalizes the text, converts words to phonemes, and predicts prosody. Speech synthesis then uses an acoustic model to produce acoustic features and a vocoder to generate the final audio waveform from those features.

What is text-to-speech used for? Text-to-speech is used for screen readers and accessibility, voice AI agents and IVR phone systems, audiobook and podcast narration, e-learning, turn-by-turn navigation, media and game voices, and healthcare reminders. It started as assistive technology and expanded as synthetic voices became natural enough for production use.

Is text-to-speech artificial intelligence? Modern text-to-speech is a form of artificial intelligence. Current systems use deep neural networks trained on large speech datasets to generate natural voices. Older concatenative and parametric methods were not AI in the modern sense, but neural text-to-speech, the current standard, is a machine learning application.

What is neural TTS? Neural TTS is text-to-speech that uses deep neural networks to model speech directly from text, instead of stitching recorded fragments or using a statistical model. It produces the most natural and expressive synthetic voices and is the basis of nearly every modern text-to-speech system.

What does TTS mean in chat or on Twitch? On Twitch, Discord, and other chat platforms, "TTS" still means text-to-speech. Streamers use a text-to-speech feature to read viewer chat messages or donation messages aloud on stream, so the underlying technology is the same as everywhere else in computing.

Build voice features with text-to-speech

Text-to-speech turns written text into a natural spoken voice through a two-stage pipeline of linguistic analysis and neural synthesis, and it underpins accessibility, voice AI agents, and audio media. Understanding the pipeline is the first step; the next is choosing how to generate the voice in production.

The Telnyx TTS API delivers text-to-speech through one API to leading voice engines, hosted on Telnyx edge infrastructure for real-time voice AI. One API, leading voice engines, edge-hosted for performance.

Explore Telnyx TTS API

Share on Social