High-Quality Text-to-Speech (TTS) Technology

This page introduces background technologies that help explain BrainCheck's TTS features. It focuses on recent speech synthesis work such as CALM and Pocket TTS. This does not mean BrainCheck uses these models directly.

Limits of Discrete-Token Speech Synthesis

One common approach in modern speech synthesis converts audio into a sequence of discrete tokens, then uses a language model to generate those tokens.

This creates a fundamental tradeoff. Audio tokens are extracted through a lossy codec, so higher audio quality usually requires generating more tokens. More tokens increase compute cost and latency. In other words, the system must compromise between quality and speed.

Continuous Audio Language Models (CALM)

Kyutai Labs proposed CALM (Continuous Audio Language Models) as one way to address this problem.

Core Idea

Instead of turning audio into discrete tokens, CALM directly generates continuous audio representations.

A large neural network (Transformer) reads the text context and creates a summary at each step.
A smaller network (MLP) uses that summary to predict the next piece of sound.
The model is trained so the sound pieces connect naturally and consistently.

Because this avoids the bitrate bottleneck of discrete tokenization, the paper reports that CALM can generate speech with less computation at comparable quality.

Discrete vs Continuous Methods

Aspect	Discrete-token method	CALM (continuous method)
Audio representation	Lossy compressed discrete tokens	Continuous latent vectors (VAE-based)
How quality improves	Generate more tokens, increasing cost	Avoid the bitrate limit of discrete tokenization
Compute efficiency	Cost grows with token count	Paper reports lower compute at similar quality
Generation style	Autoregressive token generation	Transformer plus MLP continuous-frame generation

Note: This comparison summarizes results reported in the CALM paper. Outcomes can vary depending on model, data, and implementation.

Pocket TTS: A Practical CALM Implementation

Pocket TTS is an open-source TTS model that implements research ideas from the CALM paper in a practical form.

Main Characteristics (Based on the Kyutai Labs README)

Model size: 100M parameters
CPU-only execution: Runs without a GPU and uses only two CPU cores
Faster-than-real-time generation: Reported at roughly 6x real time on a MacBook Air M4 CPU, depending on measurement environment
Very low latency: Around 200ms to the first audio chunk
Audio streaming: Playback can begin before the full generation is complete
Voice cloning: Can reproduce a speaker's voice from a short sample. Voice cloning should only be used with consent, and Kyutai Labs explicitly forbids impersonation without consent.
Long-text handling: Designed without a fixed text-length limit, although generation time increases with input length
English only for now: The model focuses on English speech synthesis

Why a GPU Is Not Required

Pocket TTS has 100M parameters, which is small compared with large language models. Kyutai Labs reports that, because the model is small and uses batch size 1, they did not observe meaningful speed gains from GPU execution.

Browser Execution

Pocket TTS does not officially support browser execution, but the model is small enough that the community has released experimental WebAssembly-based implementations using Rust/Candle, JAX-JS, ONNX Runtime Web, and related tools.

How BrainCheck Applies TTS

BrainCheck uses high-quality TTS technology when generating audio for learning cards.

Natural pronunciation: Speech closer to real conversation improves the listening experience compared with mechanical voices.
Low-latency playback: Shorter waiting time keeps the learning flow smooth.

Related paper: Continuous Audio Language Models (CALM) - Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, Alexandre Defossez (2025)

Sources: Pocket TTS GitHub · Kyutai Labs Publications