In ML training, loss drops fast. The model memorizes the training data. But test accuracy stays low — it hasn't learned the actual pattern, just the answers. Training looks perfect. Generalization is zero.
Then, much later, test accuracy suddenly jumps. The train-test gap closes in a few hundred epochs after being stuck for thousands. That phase transition is grokking — the model found the real structure underneath the data. It went from memorizing to understanding.
This matters because it means training should continue long past the point where training loss plateaus. Most practitioners stop too early. The understanding hasn't happened yet.
Measure the train-test gap (training accuracy minus test accuracy) over a sliding window. Compute the convergence rate — how fast the gap is closing. When the convergence rate exceeds 3.3 sigma above the baseline rate, grokking is happening.
The threshold is statistical, not arbitrary. 3.3 sigma corresponds to a 0.05% false positive rate under Gaussian assumptions. In practice, on modular arithmetic tasks, detection fires at 3.6 sigma — well above threshold.
Every grokking trajectory passes through four phases. The detector classifies each epoch into one of them based on train accuracy, test accuracy, and gap dynamics.
| Phase | Train | Test | Meaning |
|---|---|---|---|
| LEARNING | Rising | Rising | Both improving together |
| MEMORIZED | ~100% | Low | Train perfect, test stuck |
| GROKKING | ~100% | Jumping | Gap closing >3.3 sigma |
| UNDERSTOOD | ~100% | ~100% | Gap <10%, pattern learned |
| Epoch | Train Acc | Test Acc | Gap | Phase |
|---|---|---|---|---|
| 500 | 62% | 58% | 4% | LEARNING |
| 1000 | 99% | 31% | 68% | MEMORIZED |
| 2000 | 100% | 28% | 72% | MEMORIZED |
| 3000 | 100% | 30% | 70% | MEMORIZED |
| 3200 | 100% | 55% | 45% | MEMORIZED |
| 4000 | 100% | 97% | 3% | UNDERSTOOD |
| 4500 | 100% | 99% | 1% | UNDERSTOOD |
2,000 epochs of apparent stagnation between epoch 1000 and 3000. The model was memorized, not learning. Then the phase transition hit and test accuracy jumped from 30% to 97% in ~1,000 epochs.
The same pattern appears in human learning. Homework measures memorization (training accuracy). Quizzes measure understanding (test accuracy). The gap between them tells you whether a student has grokked the material or just memorized the procedures.
| Student | Homework | Quiz | Gap | Pattern |
|---|---|---|---|---|
| Student A | 92% | 88% | 4% | Grokked at session 9, quiz jumped 30% |
| Student B | 95% | 50% | 45% | Memorizing — homework perfect, quiz failing |
| Student C | 78% | 75% | 3% | Steady learner — train tracks test |
Student B is the dangerous case. High homework scores mask zero understanding. The gap is the diagnostic, not the grade.
Sliding window over the train-test gap time series. Convergence rate is the negative slope of the gap within the window. Sigma is computed against the pre-grok baseline. The 3.3 sigma threshold controls false positive rate (~0.04%).
MM10P honest limits: tested on simulated data, not real training runs. The simulation was designed with a clean transition at epoch 3000 — of course a transition detector finds a designed transition. Under high noise (sigma=0.15), the detector false-triggers at epoch 209 (before the real grok at 3000). Window size matters: window=20 triggers early false positives; window=200 correctly detects at epoch 3192. The student performance analog uses 12 data points — too few for statistical significance. Real grokking (Power et al. 2022) is noisier with oscillating test accuracy. The mechanistic signal (weight norm growth, Nanda et al.) is more reliable than our accuracy-based proxy. This demonstrates the detection principle, not a production system.
Computed on Mac Mini M4, 35W.