ML-Based Recommendation System
This page introduces the spaced repetition algorithm adopted by BrainCheck and summarizes its performance characteristics.
What Is a Spaced Repetition Algorithm?
A spaced repetition algorithm automatically manages the review schedule for flashcards.
The core idea is simple. Instead of memorizing everything in one intense session, reviews are spread out over time. To do this efficiently, the algorithm models how a learner's memory behaves. It predicts when a learner is likely to forget a specific item and schedules review around that moment.
The better the algorithm predicts memory, the less time a learner needs to remember material for longer.
Benchmark Dataset
The benchmark data was collected from 10,000 real users.
- Total reviews: about 727 million
- Reviews used for evaluation: about 349.9 million, excluding same-day reviews
- Data source: Hugging Face Datasets (open-spaced-repetition)
Same-day reviews are excluded from evaluation. Repeating a card several times on the same day is not directly related to long-term memory prediction, and including those records can distort an algorithm's real predictive ability.
Records created after manual date changes or under disabled settings were filtered out, and outlier filters were also applied.
Evaluation Method
Time-Series Split
The benchmark uses TimeSeriesSplit from the sklearn library. It trains on past learning records and evaluates on future learning results. This prevents the algorithm from seeing future information and measures performance under conditions closer to real use.
Note: TimeSeriesSplit is applied independently per user. Data is not mixed across users, so one user's future records do not leak into another user's training data.
Metrics
Short version: The benchmark measures how accurately an algorithm predicts the probability that a learner will remember an item.
Log Loss
Measures the difference between predicted recall probability and the actual review result, remembered or forgotten. It shows how close the algorithm's probability estimates are to real outcomes. The range is 0 to infinity, and lower is better.
RMSE (bins)
Groups predictions and actual outcomes by review interval, review count, lapse count, and similar bins, then computes a weighted root mean squared error across those bins. It evaluates accuracy across many learning situations. The range is 0 to 1, and lower is better.
AUC (Area Under the ROC Curve)
Measures how well the algorithm separates successful recall from failed recall. The range is 0 to 1, and higher is better. In practice, it is usually above 0.5.
Log Loss and RMSE (bins) measure calibration: whether predicted probabilities match actual data.
AUC measures discrimination: whether the model can distinguish between two outcomes.
Algorithm Categories
Two-Component and Three-Component Memory Models
The two-component model of long-term memory describes memory state with two independent variables:
- Retrievability (R): The probability of recalling an item
- Stability (S): The half-life of the memory
The three-component model adds Difficulty (D).
Representative algorithms:
- FSRS v1-v4: Versions that evolved from early experiments to official releases based on community feedback
- FSRS-4.5: A version with modest improvements from changing the forgetting-curve shape
- FSRS-5: A version that began using same-day review data for training
- FSRS-6: The latest version, introducing a parameter that optimizes forgetting-curve flatness per user
- HLR (Half-Life Regression): An algorithm proposed by Duolingo
Alternative Memory Models
- DASH: Short for Difficulty, Ability, Study History. It predicts memory from learner ability and study history.
- ACT-R: An activation-based system for declarative memory that explains spacing effects through memory-trace activation.
Neural Network-Based Models
- GRU: A recurrent neural network widely used for time-series prediction
- LSTM: A more elaborate recurrent neural network architecture than GRU
- RWKV: An architecture combining properties of RNNs and Transformers, using rich input features such as full review history, sibling-card information, deck structure, and day of week
Benchmark Results
The table below excerpts representative algorithms and shows evaluation results excluding same-day reviews. Values are averages; the original benchmark also provides 99% confidence intervals.
| Algorithm | Trainable parameters | Log Loss ↓ | RMSE (bins) ↓ | AUC ↑ |
|---|---|---|---|---|
| RWKV-P | 2,762,884 | 0.2773 | 0.0250 | 0.8329 |
| RWKV | 2,762,884 | 0.3193 | 0.0540 | 0.7683 |
| LSTM | 8,869 | 0.3332 | 0.0538 | 0.7329 |
| FSRS-rs | 21 | 0.3443 | 0.0635 | 0.7074 |
| FSRS-6 | 21 | 0.3460 | 0.0653 | 0.7034 |
| FSRS-5 | 19 | 0.3560 | 0.0741 | 0.7011 |
| FSRS-4.5 | 17 | 0.3624 | 0.0764 | 0.6893 |
| DASH | 9 | 0.3682 | 0.0836 | 0.6312 |
| GRU | 39 | 0.3753 | 0.0864 | 0.6683 |
| HLR (Duolingo) | 3 | 0.4694 | 0.1275 | 0.6369 |
| Ebisu v2 | 0 | 0.4989 | 0.1627 | 0.6051 |
| AVG (baseline) | 0 | 0.3945 | 0.1034 | 0.4997 |
Key Takeaways
FSRS-6: High Accuracy with a Small Model
With only 21 trainable parameters, FSRS-6 achieves strong predictive accuracy compared with much larger neural models such as RWKV-P, which has millions of parameters.
- RWKV-P (about 2.76 million parameters) Log Loss: 0.2773
- FSRS-6 (21 parameters) Log Loss: 0.3460
Its performance efficiency per parameter is very high. Because the model is small, it can run quickly even on mobile devices and can be optimized locally without a server call.
If RWKV-P Ranks First, Why Use FSRS?
RWKV-P is a large neural model with about 2.76 million parameters. It has the highest accuracy in the benchmark, but training and inference require GPU-level compute, and user-specific fine-tuning is difficult in practice. FSRS-6, by contrast, has only 21 parameters, can personalize quickly from a user's own review history, and runs in real time on ordinary devices.
Comparison with HLR (Duolingo)
Compared with Duolingo's HLR algorithm (Log Loss 0.4694), FSRS-6 (0.3460) achieves about 26% lower Log Loss.
User-Specific Forgetting-Curve Optimization
One of the biggest improvements in FSRS-6 is that it applies a different forgetting-curve shape for each user. Because people do not forget at the same speed, the review schedule can be adjusted to each learner's memory pattern.
How BrainCheck Applies Recommendation
BrainCheck tracks each learner's memory state based on the FSRS algorithm.
- Retrievability prediction: Calculates the probability that each card will be remembered.
- Optimal review timing: Schedules review just before the recall probability falls below the target level.
- Personalized parameter optimization: Continuously adjusts FSRS parameters based on the learner's review history.
This helps learners review while memory is still in a strong state, reducing unnecessary repetition and improving study efficiency.
Source: open-spaced-repetition/srs-benchmark
Dataset: open-spaced-repetition (Hugging Face Datasets)