High-Quality Text-to-Speech (TTS) Technology
This page introduces background technologies that help explain BrainCheck's TTS features. It focuses on recent speech synthesis work such as CALM and Pocket TTS. This does not mean BrainCheck uses these models directly.
Limits of Discrete-Token Speech Synthesis
One common approach in modern speech synthesis converts audio into a sequence of discrete tokens, then uses a language model to generate those tokens.
This creates a fundamental tradeoff. Audio tokens are extracted through a lossy codec, so higher audio quality usually requires generating more tokens. More tokens increase compute cost and latency. In other words, the system must compromise between quality and speed.
Continuous Audio Language Models (CALM)
Kyutai Labs proposed CALM (Continuous Audio Language Models) as one way to address this problem.
Core Idea
Instead of turning audio into discrete tokens, CALM directly generates continuous audio representations.
- A large neural network (Transformer) reads the text context and creates a summary at each step.
- A smaller network (MLP) uses that summary to predict the next piece of sound.
- The model is trained so the sound pieces connect naturally and consistently.
Because this avoids the bitrate bottleneck of discrete tokenization, the paper reports that CALM can generate speech with less computation at comparable quality.
Discrete vs Continuous Methods
| Aspect | Discrete-token method | CALM (continuous method) |
|---|---|---|
| Audio representation | Lossy compressed discrete tokens | Continuous latent vectors (VAE-based) |
| How quality improves | Generate more tokens, increasing cost | Avoid the bitrate limit of discrete tokenization |
| Compute efficiency | Cost grows with token count | Paper reports lower compute at similar quality |
| Generation style | Autoregressive token generation | Transformer plus MLP continuous-frame generation |
Note: This comparison summarizes results reported in the CALM paper. Outcomes can vary depending on model, data, and implementation.
Pocket TTS: A Practical CALM Implementation
Pocket TTS is an open-source TTS model that implements research ideas from the CALM paper in a practical form.
Main Characteristics (Based on the Kyutai Labs README)
- Model size: 100M parameters
- CPU-only execution: Runs without a GPU and uses only two CPU cores
- Faster-than-real-time generation: Reported at roughly 6x real time on a MacBook Air M4 CPU, depending on measurement environment
- Very low latency: Around 200ms to the first audio chunk
- Audio streaming: Playback can begin before the full generation is complete
- Voice cloning: Can reproduce a speaker's voice from a short sample. Voice cloning should only be used with consent, and Kyutai Labs explicitly forbids impersonation without consent.
- Long-text handling: Designed without a fixed text-length limit, although generation time increases with input length
- English only for now: The model focuses on English speech synthesis
Why a GPU Is Not Required
Pocket TTS has 100M parameters, which is small compared with large language models. Kyutai Labs reports that, because the model is small and uses batch size 1, they did not observe meaningful speed gains from GPU execution.
Browser Execution
Pocket TTS does not officially support browser execution, but the model is small enough that the community has released experimental WebAssembly-based implementations using Rust/Candle, JAX-JS, ONNX Runtime Web, and related tools.
How BrainCheck Applies TTS
BrainCheck uses high-quality TTS technology when generating audio for learning cards.
- Natural pronunciation: Speech closer to real conversation improves the listening experience compared with mechanical voices.
- Low-latency playback: Shorter waiting time keeps the learning flow smooth.
Related paper: Continuous Audio Language Models (CALM) - Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, Alexandre Defossez (2025)
Sources: Pocket TTS GitHub · Kyutai Labs Publications