Speech Recognition (STT) Technology

This page introduces background technologies that help explain BrainCheck's speech recognition features. It focuses on the structure and characteristics of modern speech recognition through OpenAI's Whisper model. This does not mean BrainCheck uses Whisper directly.

What Is Speech Recognition?

Speech recognition, or Speech-to-Text (STT), is technology that converts human speech into text.

In language learning, speech recognition plays a central role. When a learner's spoken answer is converted into text, the system can compare it with a target sentence and check how closely they match. Meaningful learning feedback depends on accurate speech recognition.

STT and Speech Evaluation Are Different

Speech recognition answers the question "What did the learner say?" by converting audio into text. Speech or pronunciation evaluation asks "How well was it said?" and deals with sound quality such as intonation, stress, connected speech, and speaking speed.

Comparing STT output with the correct answer can measure content match, but it cannot fully evaluate pronunciation quality on its own. Pronunciation evaluation may require additional signals such as phoneme-level alignment, speaking-rate analysis, and stress analysis.

Evaluation results are also affected by microphone quality, background noise, and speaking speed. For that reason, results should be treated as learning feedback rather than an absolute pronunciation score.

Whisper: Speech Recognition with Large-Scale Weak Supervision

Whisper, developed by OpenAI, is a general-purpose speech recognition model trained on 680,000 hours of multilingual audio-text pairs collected from the web. Its code and model weights are released as an open-source project under the MIT license.

Large-Scale Weak Supervision

Traditional speech recognition models are often trained on carefully hand-labeled datasets. Whisper takes a different approach. It uses a very large amount of audio and corresponding text, such as captions and transcripts, collected from the internet. Individual examples may be imperfect, but scale and diversity help compensate for that noise. Because the training data includes many recording environments, microphones, accents, and noise conditions, the model tends to be more robust in real-world settings.

Multitask Model

Whisper performs several speech-processing tasks with a single model:

Speech recognition: Converts speech into text in the same language
Speech translation: Translates speech into English text
Language identification: Detects which language is being spoken
Voice activity detection support: Helps separate speech from silence or noise for better transcription quality

These tasks are handled by one Transformer-based sequence-to-sequence model. Special tokens tell the model which task to perform, replacing several steps of a traditional speech-processing pipeline with one model.

Zero-Shot Transfer

One of Whisper's notable characteristics is that it performs well across many benchmarks without fine-tuning for each dataset. The paper calls this zero-shot transfer. It reports competitive word error rate and character error rate results against fully supervised methods on several public benchmarks without extra fine-tuning. Performance can still vary significantly by language, noise environment, and speaking style.

Model Sizes and Performance

Whisper provides several model sizes so systems can choose based on use case and runtime environment.

Model	Parameters	Multilingual	Required VRAM	Relative speed
tiny	39M	Yes	~1 GB	~10x
base	74M	Yes	~1 GB	~7x
small	244M	Yes	~2 GB	~4x
medium	769M	Yes	~5 GB	~2x
large	1.55B	Yes	~10 GB	1x
turbo	809M	Yes	~6 GB	~8x

These figures summarize the model table in the OpenAI Whisper GitHub README, including approximate VRAM and relative speed for English transcription on an A100 GPU. Real speed can vary by language, speaking pace, and hardware.

The turbo model is an optimized version of large-v3. It greatly improves speed while preserving most transcription accuracy. However, turbo is focused on transcription and is not suitable for translation from non-English speech into English text. For translation, use multilingual models such as tiny, base, small, medium, or large.

English-Only Models

The tiny, base, small, and medium sizes also provide English-only (.en) variants. When the input is only English, these models can outperform multilingual versions, especially at smaller sizes such as tiny.en and base.en. The performance gap narrows as model size increases.

Technical Structure

Whisper uses a Transformer-based encoder-decoder architecture.

Audio preprocessing: Input audio is converted into a log-Mel spectrogram, a representation of sound over frequency and time.
Encoder: The spectrogram is read and converted into a compact representation of speech features.
Decoder: Based on the encoder output, the model generates text token by token. Special tokens specify the task, such as recognition or translation, and the language.

Because the model has a limit on the input length it can process at once, long audio is split and processed in 30-second segments.

How BrainCheck Applies STT

BrainCheck's speaking features are built around speech recognition technology.

Speech recognition: Converts the learner's spoken English into text.
Content match checking: Compares recognized text with the target sentence and analyzes omissions, additions, and substitutions.
Learning feedback: Uses recognition results to provide feedback. Accuracy and response time can vary depending on the environment.

Related paper: Robust Speech Recognition via Large-Scale Weak Supervision - Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever (2022)

Sources: OpenAI Whisper GitHub · OpenAI Whisper Blog