Text-to-speech (TTS) — Pith glossary

Back to glossary

Text-to-speech (TTS) is the synthesis of spoken audio from written text, performed by a neural model that converts a string of words into a natural-sounding voice.

Why it matters

TTS crossed a quality threshold around 2023. The robotic cadence of older systems gave way to neural voices (ElevenLabs, OpenAI, Google, Microsoft) with believable prosody, pacing, and emphasis — good enough that a listener stops noticing it is synthetic. That unlocked a use case that had never quite worked before: **listening to your reading**.

The appeal is structural. Reading demands eyes and a screen; listening frees both. A commute, a walk, a gym session — dead time for reading becomes live time for consumption. The format that has emerged is podcast-shaped: a few minutes of spoken narration you can take with you, distinct from a wall of text you have to sit down for. NotebookLM's audio overviews made the pattern legible to a mass audience; the underlying move is simply to render synthesis as sound.

The trade-off is real and worth naming. Audio is linear — you cannot skim it, scan back to a half-remembered line, or check a citation mid-sentence. It is excellent for absorbing a briefing once and poor for reference. The mature view treats TTS as a *delivery channel* for the right content, not a replacement for the readable, linkable source beneath it.

How Pith relates

Pith uses TTS to generate podcast-style audio briefings from the sources you have saved — per client or per period — so you can listen to the week's reading on the move instead of sitting down to it. The text briefing and its citations remain the canonical, checkable artefact; the audio is the channel that meets you where reading cannot. See the [briefing](/glossary/briefing) entry for the format.

Why it matters

How Pith relates

See also