ML-Based Recommendation System

This page introduces the spaced repetition algorithm adopted by BrainCheck and summarizes its performance characteristics.

What Is a Spaced Repetition Algorithm?

A spaced repetition algorithm automatically manages the review schedule for flashcards.

The core idea is simple. Instead of memorizing everything in one intense session, reviews are spread out over time. To do this efficiently, the algorithm models how a learner's memory behaves. It predicts when a learner is likely to forget a specific item and schedules review around that moment.

The better the algorithm predicts memory, the less time a learner needs to remember material for longer.

Benchmark Dataset

The benchmark data was collected from 10,000 real users.

Total reviews: about 727 million
Reviews used for evaluation: about 349.9 million, excluding same-day reviews
Data source: Hugging Face Datasets (open-spaced-repetition)

Same-day reviews are excluded from evaluation. Repeating a card several times on the same day is not directly related to long-term memory prediction, and including those records can distort an algorithm's real predictive ability.

Records created after manual date changes or under disabled settings were filtered out, and outlier filters were also applied.

Evaluation Method

Time-Series Split

The benchmark uses TimeSeriesSplit from the sklearn library. It trains on past learning records and evaluates on future learning results. This prevents the algorithm from seeing future information and measures performance under conditions closer to real use.

Note: TimeSeriesSplit is applied independently per user. Data is not mixed across users, so one user's future records do not leak into another user's training data.

Metrics

Short version: The benchmark measures how accurately an algorithm predicts the probability that a learner will remember an item.

Log Loss
Measures the difference between predicted recall probability and the actual review result, remembered or forgotten. It shows how close the algorithm's probability estimates are to real outcomes. The range is 0 to infinity, and lower is better.

RMSE (bins)
Groups predictions and actual outcomes by review interval, review count, lapse count, and similar bins, then computes a weighted root mean squared error across those bins. It evaluates accuracy across many learning situations. The range is 0 to 1, and lower is better.

AUC (Area Under the ROC Curve)
Measures how well the algorithm separates successful recall from failed recall. The range is 0 to 1, and higher is better. In practice, it is usually above 0.5.

Log Loss and RMSE (bins) measure calibration: whether predicted probabilities match actual data.
AUC measures discrimination: whether the model can distinguish between two outcomes.

Algorithm Categories

Two-Component and Three-Component Memory Models

The two-component model of long-term memory describes memory state with two independent variables:

Retrievability (R): The probability of recalling an item
Stability (S): The half-life of the memory

The three-component model adds Difficulty (D).

Representative algorithms:

FSRS v1-v4: Versions that evolved from early experiments to official releases based on community feedback
FSRS-4.5: A version with modest improvements from changing the forgetting-curve shape
FSRS-5: A version that began using same-day review data for training
FSRS-6: The latest version, introducing a parameter that optimizes forgetting-curve flatness per user
HLR (Half-Life Regression): An algorithm proposed by Duolingo

Alternative Memory Models

DASH: Short for Difficulty, Ability, Study History. It predicts memory from learner ability and study history.
ACT-R: An activation-based system for declarative memory that explains spacing effects through memory-trace activation.

Neural Network-Based Models

GRU: A recurrent neural network widely used for time-series prediction
LSTM: A more elaborate recurrent neural network architecture than GRU
RWKV: An architecture combining properties of RNNs and Transformers, using rich input features such as full review history, sibling-card information, deck structure, and day of week

Benchmark Results

The table below excerpts representative algorithms and shows evaluation results excluding same-day reviews. Values are averages; the original benchmark also provides 99% confidence intervals.

Algorithm	Trainable parameters	Log Loss ↓	RMSE (bins) ↓	AUC ↑
RWKV-P	2,762,884	0.2773	0.0250	0.8329
RWKV	2,762,884	0.3193	0.0540	0.7683
LSTM	8,869	0.3332	0.0538	0.7329
FSRS-rs	21	0.3443	0.0635	0.7074
FSRS-6	21	0.3460	0.0653	0.7034
FSRS-5	19	0.3560	0.0741	0.7011
FSRS-4.5	17	0.3624	0.0764	0.6893
DASH	9	0.3682	0.0836	0.6312
GRU	39	0.3753	0.0864	0.6683
HLR (Duolingo)	3	0.4694	0.1275	0.6369
Ebisu v2	0	0.4989	0.1627	0.6051
AVG (baseline)	0	0.3945	0.1034	0.4997

Key Takeaways

FSRS-6: High Accuracy with a Small Model

With only 21 trainable parameters, FSRS-6 achieves strong predictive accuracy compared with much larger neural models such as RWKV-P, which has millions of parameters.

RWKV-P (about 2.76 million parameters) Log Loss: 0.2773
FSRS-6 (21 parameters) Log Loss: 0.3460

Its performance efficiency per parameter is very high. Because the model is small, it can run quickly even on mobile devices and can be optimized locally without a server call.

If RWKV-P Ranks First, Why Use FSRS?

RWKV-P is a large neural model with about 2.76 million parameters. It has the highest accuracy in the benchmark, but training and inference require GPU-level compute, and user-specific fine-tuning is difficult in practice. FSRS-6, by contrast, has only 21 parameters, can personalize quickly from a user's own review history, and runs in real time on ordinary devices.

Comparison with HLR (Duolingo)

Compared with Duolingo's HLR algorithm (Log Loss 0.4694), FSRS-6 (0.3460) achieves about 26% lower Log Loss.

User-Specific Forgetting-Curve Optimization

One of the biggest improvements in FSRS-6 is that it applies a different forgetting-curve shape for each user. Because people do not forget at the same speed, the review schedule can be adjusted to each learner's memory pattern.

How BrainCheck Applies Recommendation

BrainCheck tracks each learner's memory state based on the FSRS algorithm.

Retrievability prediction: Calculates the probability that each card will be remembered.
Optimal review timing: Schedules review just before the recall probability falls below the target level.
Personalized parameter optimization: Continuously adjusts FSRS parameters based on the learner's review history.

This helps learners review while memory is still in a strong state, reducing unnecessary repetition and improving study efficiency.

Source: open-spaced-repetition/srs-benchmark
Dataset: open-spaced-repetition (Hugging Face Datasets)