← Research

Grokking Detection

Phase transition from memorization to understanding. 3.6 sigma detection. 932K epochs/sec.

What Grokking Is

In ML training, loss drops fast. The model memorizes the training data. But test accuracy stays low — it hasn't learned the actual pattern, just the answers. Training looks perfect. Generalization is zero.

Then, much later, test accuracy suddenly jumps. The train-test gap closes in a few hundred epochs after being stuck for thousands. That phase transition is grokking — the model found the real structure underneath the data. It went from memorizing to understanding.

This matters because it means training should continue long past the point where training loss plateaus. Most practitioners stop too early. The understanding hasn't happened yet.

Detection Method

Measure the train-test gap (training accuracy minus test accuracy) over a sliding window. Compute the convergence rate — how fast the gap is closing. When the convergence rate exceeds 3.3 sigma above the baseline rate, grokking is happening.

The threshold is statistical, not arbitrary. 3.3 sigma corresponds to a 0.05% false positive rate under Gaussian assumptions. In practice, on modular arithmetic tasks, detection fires at 3.6 sigma — well above threshold.

Modular Arithmetic mod 97

True grok onset: epoch 3000
Detected at: epoch 3073
Detection latency: 73 epochs
Sigma at detection: 3.6
Throughput: 932,150 epochs/sec

Four Phases

Every grokking trajectory passes through four phases. The detector classifies each epoch into one of them based on train accuracy, test accuracy, and gap dynamics.

Phase Definitions

PhaseTrainTestMeaning
LEARNINGRisingRisingBoth improving together
MEMORIZED~100%LowTrain perfect, test stuck
GROKKING~100%JumpingGap closing >3.3 sigma
UNDERSTOOD~100%~100%Gap <10%, pattern learned

Phase Classification — mod 97

EpochTrain AccTest AccGapPhase
50062%58%4%LEARNING
100099%31%68%MEMORIZED
2000100%28%72%MEMORIZED
3000100%30%70%MEMORIZED
3200100%55%45%MEMORIZED
4000100%97%3%UNDERSTOOD
4500100%99%1%UNDERSTOOD

2,000 epochs of apparent stagnation between epoch 1000 and 3000. The model was memorized, not learning. Then the phase transition hit and test accuracy jumped from 30% to 97% in ~1,000 epochs.

Student Performance Analog

The same pattern appears in human learning. Homework measures memorization (training accuracy). Quizzes measure understanding (test accuracy). The gap between them tells you whether a student has grokked the material or just memorized the procedures.

Three Students

StudentHomeworkQuizGapPattern
Student A92%88%4%Grokked at session 9, quiz jumped 30%
Student B95%50%45%Memorizing — homework perfect, quiz failing
Student C78%75%3%Steady learner — train tracks test

Student B is the dangerous case. High homework scores mask zero understanding. The gap is the diagnostic, not the grade.

Method

Sliding window over the train-test gap time series. Convergence rate is the negative slope of the gap within the window. Sigma is computed against the pre-grok baseline. The 3.3 sigma threshold controls false positive rate (~0.04%).

MM10P honest limits: tested on simulated data, not real training runs. The simulation was designed with a clean transition at epoch 3000 — of course a transition detector finds a designed transition. Under high noise (sigma=0.15), the detector false-triggers at epoch 209 (before the real grok at 3000). Window size matters: window=20 triggers early false positives; window=200 correctly detects at epoch 3192. The student performance analog uses 12 data points — too few for statistical significance. Real grokking (Power et al. 2022) is noisier with oscillating test accuracy. The mechanistic signal (weight norm growth, Nanda et al.) is more reliable than our accuracy-based proxy. This demonstrates the detection principle, not a production system.

Computed on Mac Mini M4, 35W.

GUMPResearch · [email protected]