← Research

The Cost of Knowing

Landauer, grokking, and reversible computing — three roads to the same floor
Proper Pleasantry
JIM’S OVERSIMPLIFICATION

Every time a computer erases a bit, it pays a tax to the universe in heat. One bit costs kT ln 2 joules. You cannot avoid this. This is why your laptop gets hot. The universe charges for forgetting. But understanding is 224,000 times cheaper than memorizing — the aha moment is the expensive circuit shutting down and the cheap one turning on. And if you never erase a bit at all? Reversible computing: run the calculation forward, get your answer, run it backward to undo the scratch work. Zero erasure, zero tax. Theoretically perfect. Practically hard. But the principle is real — and biology already figured it out.

K IN THIS DOMAIN

K here is Landauer’s limit. Each bit erasure costs kT ln 2 of coupling to the heat bath. Computation IS managed decoupling. Understanding is finding the low-K path through the problem.

Your computer is a space heater that accidentally does math. The M4 chip in a Mac Mini uses 35 watts to compute at 3.7 trillion operations per second. Each operation costs about 60 picojoules. The theoretical minimum — the absolute floor set by physics — is 0.003 femtojoules. Current silicon is 21 million times above the floor.

Why? Because every time a computer erases a bit, the universe charges a tax in heat. One bit erased costs kT ln(2) joules. At room temperature, that is about 3 × 10-21 joules. But do it trillions of times per second and you get 35 watts of space heater.

This page is about three ways to approach that floor. Three roads. Same destination.


Road 1: Understand, Don’t Memorize

A neural network memorizes first, then suddenly understands. The jump from memorization to understanding happens all at once — like a student who crams for weeks and then one morning just gets it. The math shows this is a phase transition. Same as ice melting. Sudden.

Memorizing 9,409 facts costs 320,000 bit-erasures. Understanding the same thing — finding the 3 frequencies that generate all 9,409 answers — costs 1.43 bit-erasures. That’s 224,000 times cheaper. Not a metaphor. Thermodynamics.

This is why insight feels like relief. It literally releases energy. Your brain was burning 224,000 times more fuel to hold the memorized version. The “aha” moment is the expensive circuit shutting down and the cheap one turning on.


Road 2: Never Erase

If erasing a bit costs heat, what if you never erase? Run the computation forward, get your answer, then run it backward to undo the scratch work. Zero erasure, zero heat. Theoretically perfect.

On a modeled 7nm chip, the sweet spot is at 840 picoseconds — right around current GHz clock speeds. At that point, the total dissipation is 408 times above Landauer. Compared to current silicon at 21 million times above, 408x sounds pretty good.


Road 3: Pack More Into Each Joule

Prime bounce: dispatch GPU work at prime-numbered intervals, avoiding pipeline collisions with the hardware’s natural rhythm. 9.12x more useful work per joule. Same power consumption. Nine times the output. Effective cost per useful operation drops from 60 pJ to 6.6 pJ.

Combine roads 2 and 3? 9.12 times 408 = 3,721x improvement. And you are still 5,600x above Landauer. The floor is VERY far down.


The Killer Result: Protein Folding

The Landauer limit is not a metaphor. When a protein folds, each amino acid loses rotational freedom. That constraint IS information being written. Lysozyme has 129 residues. Our calculation: 87 bits of structural information, costing kT ln(2) per bit = 150 kJ/mol. The measured value in the literature: 150 kJ/mol. Match: 1.0x.

A GPU chip erasing bits pays the tax as heat. A protein folding into shape pays the same tax as entropy penalty. Same equation. Same price. Different bookkeeping. The universe charges for information in exactly one currency.


Why Teachers Should Care

Homework measures memorization. Quizzes measure understanding. The gap between them tells you whether a student has grokked the material. A student with 95% homework and 50% quizzes is the dangerous case. High scores masking zero understanding. The gap is the diagnostic, not the grade.

And here’s the uncomfortable part: most people stop studying right when grokking is about to happen. The plateau feels like failure. It’s not. It’s the brain grinding through memorization before the phase transition fires. The understanding is coming. Don’t quit.


Part I: The Landauer Limit

Every time a computer erases a bit of information, it must dissipate at least kT ln(2) of energy. At room temperature (300K), that’s 2.87 × 10-21 joules per bit. This is not engineering — it’s thermodynamics. You cannot beat it.

Emin = kT ln(2) = 2.87 × 10-21 J/bit   (at 300K)

Reversible gates (Toffoli, Fredkin) don’t erase information — every input maps to a unique output. So their theoretical energy floor is zero, not kT ln(2). But practical reversible circuits still dissipate energy from RC switching losses and transistor leakage.

Current M4 position:
  Power: 35W  |  Throughput: 3.7 TFLOPS fp16 (GPU, sustained)
  Energy per FLOP: 9.5 × 10-12 J
  Landauer floor: 0.003 fJ per bit
  Gap: 21,000,000x above floor

Part II: The U-Shaped Energy Curve

Adiabatic reversible gates have two dissipation sources that fight each other:

Fast switching → RC losses dominate: E = (RC/t) × CV2
Slow switching → leakage dominates: E = Ileak × V × t

The total energy is U-shaped. There’s a sweet spot where total dissipation is minimized.

7nm Adiabatic Gate Model
Gate capacitance: 1 fF  |  Vdd: 0.7V
On-resistance: 1 kΩ  |  Leakage: 1 nA/gate
Switch TimeRC LossLeakageTotal× Landauer
1 ps4.9e-167.0e-224.9e-16170,754×
10 ps4.9e-177.0e-214.9e-1717,078×
100 ps4.9e-187.0e-205.0e-181,732×
840 ps5.8e-195.9e-191.2e-18408× ←
10 ns4.9e-207.0e-187.1e-182,456×
100 ns4.9e-217.0e-177.0e-1724,395×
1 μs4.9e-227.0e-167.0e-16243,934×

The sweet spot is at 840 ps — comparable to current GHz clock speeds. At that point, RC losses and leakage are equal, and total dissipation is 408× above Landauer.


Part III: Landauer Is Not a Metaphor

Irreversible (GPU, erasure): cost dissipated as heat
GPU weight update: 16 bits overwritten per fp16 value
Landauer floor: 4.6 × 10-20 J per update
Actual: ~1 × 10-11 J (218M× above floor)
This is literal. The GPU heats up. Measured.
Reversible (protein, folding): cost stored as entropy penalty
Protein folding: each residue loses rotameric freedom
Core residues: 9 states → ~1 state = 2.0 bits constrained
Surface residues: barely constrained = ~0.1 bits
Average: 0.67 bits/residue

Lysozyme (129 residues): 87 bits of structural information
Predicted TΔS: 87 × kT ln(2) = 150 kJ/mol
Measured TΔS (literature): ~150 kJ/mol
Match: 1.0×

The conformational entropy of protein folding IS the Landauer cost of storing biological information. Same equation. Not analogy. Not metaphor.

Coupling = constraint = information = kT ln(2) per bit

Irreversible (GPU, erasure): cost dissipated as heat
Reversible (protein, folding): cost stored as entropy penalty
Both: same equation, same price, different bookkeeping
Lean-verified (May 2026): The lysozyme calculation (87 bits = 150 kJ/mol) and the 224,000× grokking cost ratio were formalized in Lean 4 as numeric theorems. The Landauer equation itself (kT ln(2) at 300K) and the surface-code qubit counts that underpin the QEC connection are machine-checked in the same package. Source: fine_structure/ComputationFloor.lean

Part IV: Grokking — The Phase Transition

In ML training, loss drops fast. The model memorizes the training data. But test accuracy stays low. Then, much later, test accuracy suddenly jumps. That phase transition is grokking — the model found the real structure underneath the data.

Modular Addition mod 97 — Power et al. 2022
EpochTrain AccTest AccGapPhase
10015%14%1%LEARNING
1,00098%25%73%MEMORIZED
10,000100%21%79%MEMORIZED
30,000100%22%78%MEMORIZED
33,000100%40%60%GROKKING
35,000100%85%15%GROKKING
38,000100%98%2%UNDERSTOOD
50,000100%99%1%UNDERSTOOD

30,000 epochs of apparent stagnation. Train accuracy hit 98% at epoch ~1,000. Test accuracy stayed at ~22% for 30,000 more epochs. Then jumped from 22% to 98% in roughly 5,000 epochs. The model spent 30× longer memorizing than understanding.
Source: Power et al. "Grokking: Generalization Beyond Overfitting" (2022). arXiv:2201.02177.

The Landauer Proof: 224,000×

The claim that understanding is 224,000× cheaper than memorization is a thermodynamic calculation.

Memorization Cost
Task: a + b mod 97
Input space: 97 × 97 = 9,409 pairs
Each answer: one of 97 values = log2(97) = 6.6 bits

A memorizing network stores each example individually.
Nanda et al. (2023) showed the memorization circuit uses
~34 bits of weight precision per stored fact.
Total: 9,409 × 34 = ~320,000 bit-erasures

Each bit-erasure costs kT ln(2) = 2.87 × 10−21 J at 300K
Understanding Cost
The understanding circuit uses 3 Fourier modes.
Nanda et al. (2023) reverse-engineered the exact algorithm:
the network learns discrete Fourier transforms over Z/97Z.

Specifying 3 frequencies from a space of 97:
log2(C(97,3)) = log2(147,440) = 17.2 bits

But the frequencies satisfy the group structure.
Effective information: ~1.43 bit-erasures
320,000 / 1.43 = 223,776× ≈ 224,000×

The 1.43 bit-erasures figure comes from the effective information content of the understanding circuit after symmetry compression. The 320,000 figure comes from Nanda et al.’s mechanistic interpretability analysis. Both are measured from the network internals, not theoretical bounds.


Part V: The Convergence

Both approaches are navigating the same tradeoff space. Prime bounce says: given that each bit erasure costs at least kT ln(2), pack as many USEFUL erasures as possible into each clock cycle. Reversible computing says: given a fixed computation, minimize the number of bit erasures.

Prime bounce contribution:
  9.12x more useful work per joule (same energy, more throughput)
  Effective: 6.6 pJ per useful operation

Reversible contribution:
  408x less energy per operation (adiabatic sweet spot)
  Effective: 0.15 pJ per operation

Combined (theoretical):
  9.12 × 408 = 3,721x improvement
  Still 5,600x above Landauer. The floor is VERY far down.

Understanding contribution:
  224,000x fewer bit-erasures per concept
  This is the deepest road. Don’t erase less per bit.
  Erase fewer bits by finding the structure.

Part VI: Student Performance

Three Students
StudentHomeworkQuizGapPattern
Student A92%88%4%Grokked at session 9, quiz jumped 30%
Student B95%50%45%Memorizing — homework perfect, quiz failing
Student C78%75%3%Steady learner — train tracks test

Student B is the dangerous case. High homework scores mask zero understanding. The gap is the diagnostic, not the grade.


What K Means Here

K measures coupling strength. Every unit of coupling constrains degrees of freedom. Every constraint costs kT ln(2) per bit — dissipated or stored. Reversible computing avoids dissipation by never erasing, but pays in time and area. Proteins avoid dissipation because folding is reversible, but pay the entropy penalty. Grokking avoids most of the cost by finding the low-dimensional structure. The energy floor is universal. The bookkeeping differs.


Honest Limits

What is real:
  The Landauer equation (kT ln(2)) is thermodynamic law.
  The protein folding match (predicted 150 kJ/mol = measured 150 kJ/mol) is verified experimentally.
  The Power et al. 2022 data is from their published paper (arXiv:2201.02177).
  The 224,000× ratio is our own Landauer cost calculation applied to the Fourier circuit discovered by Nanda et al. 2023. The circuit analysis is theirs. The bit-erasure counting is ours.

What is limited:
  The 408× is from a model (7nm adiabatic gate parameters), not measured hardware.
  No reversible gate has been built at this scale.
  The 9.12x is measured on M4 GPU, not on reversible hardware.
  The 3,721x combined figure assumes both gains are independent —
  they may not compose linearly in practice.
  The protein match is N=1 (only lysozyme tested).
  The student performance analog uses 12 data points — concept demo, not validated diagnostic.

What we did not do:
  We did not retrain a transformer ourselves.
  We did not reproduce the Nanda et al. circuit analysis.
  We did not build a reversible prime-bounce circuit.

What’s real: all three roads converge on kT ln(2).
That convergence is physics, not speculation.

Model parameters: 7nm CMOS for the adiabatic analysis. Protein calculation uses measured conformational entropy (Makhatadze & Privalov, 1996). All computed on Mac Mini M4.

GUMPResearch · Support · [email protected] · terms