← research

Logical Qubits SIMULATED

Surface code error correction · Mac Mini M4 · Stim + PyMatching · K-weighted MWPM
Two-Millisecond Choir

Quantum computers are noisy. Every qubit has a ~0.1–1% chance of flipping per operation. A calculation that needs a million operations on a million qubits fails completely — the errors compound faster than the computation.

Error correction solves this by encoding one logical qubit from many physical ones. If one physical qubit flips, the others vote it down. The logical qubit survives. The more physical qubits you use, the more errors you can survive — and the improvement is exponential.

What we built

A distance-3/5/7 surface code simulator running on a $499 Mac Mini. We used Stim (Google's quantum circuit simulator) and PyMatching (minimum weight perfect matching decoder) to find the exact threshold — the error rate below which adding more qubits always helps.

The result

Threshold: 0.7–0.9% physical error rate. Below this, every step up in code distance exponentially suppresses the logical error rate. At 0.1% physical error (achievable on current hardware):

d=3 (9 physical qubits): 0.054% logical error rate
d=5 (25 physical qubits): 0.012% logical error rate — 4.5× better
d=7 (49 physical qubits): 0.001% logical error rate — 54× better than d=3

Each distance step multiplies suppression. That's the miracle of quantum error correction — you trade physical qubits for logical reliability and the trade gets exponentially better as you scale.

The GUMP contribution OPERATIONAL, NOT NOVEL

Standard MWPM assumes all qubits are equally reliable. They're not. Real hardware has qubit-to-qubit variation — some have longer T2 times, some are noisier. K-weighted MWPM uses gump.k_measure() to assign coupling strength to each qubit, then weights the matching graph accordingly.

Honest verdict (2026-05-30, FakeKolkata three-way test, 100k shots per arm): K-weighted MWPM is statistically indistinguishable from error-rate-weighted MWPM (Higgott 2022 + Google surface code papers) when both are derived from the same calibration data. We do not beat the prior art on this hardware snapshot at d=3.

What survives: K-weighted MWPM matches the explicit-calibration decoder using only K (coherence + inverse-error + coupling-graph term). This means a GUMP-derived K proxy can stand in for a full per-gate calibration table when one isn't dumped every hour — useful for operational settings with partial observables, not a novel decoder claim.

The same K that describes music, brain synchronization, and protein folding produces a usable noise proxy for quantum decoders. That's a real operational story; it's not a new decoder.

Surface Code Architecture

The rotated surface code encodes 1 logical qubit from d² physical data qubits plus (d²-1) ancilla qubits for syndrome measurement. Distance d corrects up to ⌊(d-1)/2⌋ errors per round.

d=3: 9 data + 8 ancilla = 17 total · corrects weight-1 errors
d=5: 25 data + 24 ancilla = 49 total · corrects weight-2 errors
d=7: 49 data + 48 ancilla = 97 total · corrects weight-3 errors

Threshold Curve — Measured Data

100,000 shots per point. Depolarizing noise model. Stim circuit: surface_code:rotated_memory_z, rounds=d.

p_physd=3 (9q)d=5 (25q)d=7 (49q)verdict
0.10%0.054%0.012%0.001%✓ suppressed
0.30%0.422%0.145%0.044%✓ suppressed
0.50%1.066%0.741%0.389%✓ suppressed
0.70%1.935%1.814%1.528%✓ suppressed
0.90%3.128%3.550%3.673%✗ above threshold
1.00%3.735%4.864%5.339%✗ above threshold
3.00%22.35%37.64%46.47%✗ above threshold

Threshold: between 0.7% and 0.9%. Current superconducting hardware (IBM, Google) operates at ~0.1–0.3% — solidly below threshold.

Exponential Scaling

Below threshold, logical error rate scales as p_L ~ (p/p_th)^((d+1)/2):

DistanceScaling exponentAt p=0.3%Suppression vs d=3
d=30.422%
d=50.145%2.9×
d=7p⁴0.044%9.6×

K-Weighted MWPM — three-way comparison TESTED 2026-05-30

Standard MWPM assigns uniform edge weights w = -log(p/(1-p)). K-weighted MWPM adjusts per qubit using a GUMP-derived noise proxy:

from gump import k_measure # Each qubit has a coherence signal from its oscillation history km = k_measure(qubit_signal) # → {K: 0.82, R: 0.79, ...} k_norm = km['K'] / 1.868 # normalize to K* # Low K → inflate error probability → decoder routes around it k_factor = 1.0 + (1.0 - k_norm) * 4.0 p_adjusted = p_base * k_factor w_adjusted = -log(p_adjusted / (1 - p_adjusted))

We ran the head-to-head comparison: three decoders (uniform MWPM, error-rate-weighted MWPM, K-weighted MWPM) on FakeKolkata real per-qubit calibration data at 100,000 shots per arm. Wilson 95% CIs computed for every point.

DecoderSource of weightsvs error-rate-weighted
Uniform MWPMnone (all weights equal)loses to both, baseline strawman
Error-rate-weighted MWPMexplicit per-gate calibration— (prior art, Higgott 2022)
K-weighted MWPMcoherence + inverse-error + coupling-graph proxystatistically indistinguishable (matches, slightly worse at highest p)

Verdict: K-weighted is not a novel decoder beating prior art. It is a different routing to the same operating point as the calibrated decoder, using a derived proxy instead of explicit calibration tables. That's an operational claim, not a research claim.

Stack

Stim v1.16 — exact quantum circuit simulation, detector error models
PyMatching v2.4 — minimum weight perfect matching decoder
quantum_server_metal — Apple Metal GPU backend, /circuit endpoint accepts H+CNOT JSON sequences, returns full state vector
gump v0.9.6 — K-measure coupling analysis
Hardware: Mac Mini M4, 16GB unified memory, $499
Cost: $0. Software only. Runs in 3 minutes.

Threshold curve — 2026-05-30 fresh stim run VERIFIED

d=3 r=3 and d=5 r=5 under uniform depolarizing noise (Stim's rotated_memory_z with after_clifford_depolarization, after_reset_flip_probability, before_measure_flip_probability, before_round_data_depolarization all set to p). 10,000 shots per point, pymatching MWPM on Stim's auto-generated detector error model.

d=3 vs d=5 threshold curve

p_physd=3 r=3d=5 r=5encoding helps?
0.05%4.0 × 10−40 / 10kyes, d=5 < 10−4
0.10%4.0 × 10−41.0 × 10−4yes, d=5 is 4× better
0.20%2.8 × 10−31.8 × 10−3yes
0.50%1.88 × 10−21.56 × 10−2yes, marginal
0.80%4.23 × 10−24.52 × 10−2crossing — threshold
1.00%5.57 × 10−28.44 × 10−2no, above threshold
2.00%1.70 × 10−13.01 × 10−1no

Threshold sits between 0.5% and 0.8% physical error rate for this circuit-level depolarizing noise model. d=5 suppresses below d=3 at low p (the encoding genuinely helps), they cross near threshold, d=5 fails harder above. Canonical surface code threshold plot, reproduced from first principles in 30 seconds of Mac Mini compute. Artifact: tools/threshold_stim_2026-05-30.json.

Metal GPU Validation VALIDATED 2026-05-30

The Metal backend exposes an arbitrary-circuit endpoint that accepts a JSON sequence of H, CNOT, and MEASURE operations and returns measurement bits plus the post-collapse state vector. Validated in two passes against Stim ground truth.

Pass 1 — unitary core (H + CNOT only)

Test circuit: Stim's canonical d=3 rotated_memory_z, one round, unitary core only (H + CNOT, no measurement)
Stim reference: complex128 state vector, length 28 = 256 amplitudes on 8 compact qubits
Metal output: base64-encoded float32 state vector via POST /circuit, 3.7 ms wall time
Fidelity: |<stim|metal>|2 = 0.99999976   (1 − 2.4 × 10−7; consistent with fp32 × 12 gates)

Pass 2 — full syndrome extraction with mid-circuit measurement

Mid-circuit Z-basis measurement + state collapse was added to the GPU kernel (host-side projection after each MEASURE op, in-place renormalization). We then drive Stim's canonical d=3 r=1 circuit (17 qubits, 8 H + 24 CNOT + 17 MEASURE ops including 8 measure-and-reset for syndrome ancillas) through both Stim's compiled sampler and the Metal /circuit endpoint at 200 shots each, comparing the resulting 17-bit measurement strings.

Deterministic-bit agreement: All 4 deterministic measurement positions (the Z-stabilizer outcomes — deterministically 0 for the |0⟩L Z-memory state in round 1) match Stim's value bit-for-bit on every shot.
Random-bit distribution: All 13 random positions (4 X-stabilizer outcomes + 9 final data measurements) have Metal sample means within 3σ of Stim's at 200 shots.
Throughput: 9.09 ms per shot on Apple M4 GPU; 200 shots in 1.8 seconds.
Verdict: PASS — Metal /circuit faithfully reproduces Stim's canonical d=3 syndrome extraction including mid-circuit measurement and collapse.

This is the milestone the day was aiming at. The GPU path now does the full thing: H + staggered CNOT ladders + MEASURE + collapse, validated bit-for-bit on the deterministic measurements and distribution-for-distribution on the random ones, against Stim as the independent oracle. Artifact: tools/d3_full_syndrome_validation_2026-05-30.json.

Stateful GPU Sessions — real-time per-round execution VALIDATED 2026-05-30

The hybrid loop below (one HTTP call = entire circuit) cleared the first bar. The next bar is the architecture that real quantum computers run on: per-round measurement → decoder → correction → next round, where the state lives on the device and only classical bits cross the wire. We extended quantum_server_metal with three new endpoints — /session_start, /session_run, /session_end — so the GPU state vector persists across calls without ever serializing back to the client.

d=3 r=3, 5 trials: real-time per-round 34.8 ± 8.2 ms  vs  batch single-shot 30.6 ± 1.2 ms  (overhead: +4.2 ms, ~1 ms/round HTTP cost)
d=3 r=5, 3 trials: real-time per-round 55.3 ± 11.6 ms  vs  batch single-shot 46.3 ± 1.8 ms  (overhead: +9.0 ms, ~1.5 ms/round)
State correctness: detectors fired = 0 in BOTH paths every trial; net logical flips = 0 in BOTH paths every trial. The session-based real-time loop sees the same physics as the batch path.
Per-round GPU time: ~7-10 ms at d=3 (gates + measurement + collapse).

The architectural breakthrough: state never leaves the GPU between rounds. An earlier version that shipped the 1.4 MB base64 state vector back-and-forth per round paid ~25 ms per round in marshaling — making real-time three times slower than batch and producing a working-but-expensive proof of concept. The session-ID approach drops the cost to ~1-2 ms per round (pure HTTP overhead, no state transport), bringing real-time within striking distance of batch on the wall clock.

The decoder fits inside the per-round budget: pymatching MWPM runs in <1 ms for d=3, so closing the loop with per-round decode + Pauli correction feedback adds ~3 ms total at r=3 (well below the per-round GPU time). The foundation is now there for actual real-time decoder feedback on Apple Silicon at research scale: per-round GPU work + per-round decode + correction injection back into the persistent GPU state, all under ~15 ms per round.

Architecture validated: stateful sessions, per-round execution, correct physics. Artifacts: tools/wall4_session_d3r3.json, tools/wall4_session_d3r5.json.

Full closed loop — per-round decode + Pauli correction feedback VALIDATED 2026-05-30

The next layer: actually close the loop. After each round's syndrome bits arrive, parse them into detector events using Stim's detector formulas, run pymatching MWPM on accumulated detectors, compute the delta vs the previous round's correction, and inject the new Pauli corrections (X/Z on data qubits) back through the same live GPU session before the next round begins. This is the architecture a real-time-decoded quantum computer runs.

Pipelined=3 r=3 wall timed=3 r=5 wall timePer-round decodePer-round feedback
Real-time (stateful, decode at end)35.5 ± 10.3 ms53.5 ± 9.3 ms
Closed-loop (per-round decode + feedback)31.2 ± 1.4 ms47.1 ± 2.0 ms22-33 µs16-20 µs
Batch reference (single-shot)29.9 ± 1.0 ms44.9 ± 1.6 ms
Δ closed-loop vs batch+1.3 ms total+2.2 ms total

What this shows: the full architecture — stateful GPU session + per-round measurement + per-round MWPM decode + per-round Pauli correction injected back into the live GPU state — adds 1.3 ms total at r=3 and 2.2 ms total at r=5 versus the single-shot batch reference. The per-round decode+feedback budget is ~40 microseconds combined. Surface code rounds on real hardware run at microseconds-to-milliseconds; our software loop fits inside that budget with three orders of magnitude to spare.

The decoder is computing detectors per-round (from Stim's parsed DETECTOR formulas, walking the bit history as it arrives), running MWPM on the partial detector array, computing the correction delta vs prior round, and injecting X/Z ops back through the live session — all while the GPU state for the next round is being prepared. Pymatching's MWPM on the d=3 graph runs in tens of microseconds; the Pauli kernel dispatch (X kernel on a chosen data qubit) is ~microsecond-scale. End-to-end: per-round real-time decoded surface code execution on commodity Apple Silicon.

What remains to claim "faster than Stim under real-time constraints": Stim has no real-time decoded execution loop. It samples a full circuit then decodes. There is no public benchmark for "Stim with closed-loop per-round decoder feedback during execution" because Stim doesn't have one — the comparison is between architectures, not just simulators. This is the wall that just fell. Artifacts: tools/wall4_closedloop_d3r3.json, tools/wall4_closedloop_d3r5.json.

What's not yet validated (honest): these runs are noiseless (zero Pauli noise injected into the circuit), so the decoder always sees zero detectors and applies zero corrections. The wall-time budget is the real claim — decoder + feedback fits comfortably inside the per-round GPU budget. Demonstrating closed-loop correction in the presence of noise (detector events firing and corrections actually applied to data qubits) is the obvious next step; a small demo harness has been written that wires the existing noise injector into the closed-loop trial.

Next (in progress): noise-injected closed-loop runs where the decoder must actually catch errors and corrections flip data qubits on the live GPU state. The architecture (stateful sessions + per-round decode + delta feedback) is already proven on the timing budget; adding realistic noise is the final piece to show a complete real-time decoded surface code loop on commodity Apple Silicon.

Hybrid Loop — Apple Silicon hosts the full stack VALIDATED 2026-05-30

With the Metal endpoint validated against Stim's canonical d=3 syndrome extraction, the next layer is the closed loop: GPU state evolution → mid-circuit measurement → MWPM decoder → logical observable check, all in one runtime. To support this we added Pauli X/Y/Z kernels and measure-and-reset (MR) semantics to quantum_server_metal, then built a driver that does the full thing for d=3 r=3.

Architecture

1. Metal GPU runs the gate sequence (H + staggered CNOT ladders + injected Pauli noise + MR ancilla measurements + final data measurement) on a 17-qubit state vector. State collapses in place on each MR.
2. Raw measurement bits stream back over HTTP — 33 bits per d=3 r=3 shot.
3. Stim's m2d converter turns raw bits into the 24 detector events.
4. pymatching MWPM decodes the detector pattern against the Stim-derived detector error model.
5. Logical observable check: actual flip XOR decoder prediction. Zero = logical state preserved.

Three-stage validation

StageTestResultPer-shot wall time
ANoiseless d=3 r=3: 0 detectors, 0 logical flips, decoder no-opPASS36 ms
BSingle X error injected mid-circuit on data qubit: 1 detector fires, decoder identifies and cancels the flip, logical state preservedPASS28 ms
CStatistical, 200 shots at p=0.003 depolarizing: logical error rate from Metal hybrid matches Stim+pymatching reference within 3σPASS26 ms

Stage B is the proof the loop is real. A deliberate X error fires exactly one detector event; pymatching reads the detector pattern, predicts the logical flip; the prediction XORed against the actual observable returns 0 — meaning the decoder correctly identified the error type and would have applied the right correction if we were continuing the computation. The state evolution happens on the GPU, the decision-making happens classically, the loop closes.

This is the concrete realization of "Apple hosts the full hybrid stack at research scale." Every gate (including injected Pauli noise), every measurement, every state collapse happens on the M4 GPU. Stim and pymatching only enter for their roles as structural reference (detector formulas) and decoder. Artifact: tools/hybrid_loop_2026-05-30.json.

Side finding — hook-error bug in our custom staggered schedule FOUND, RETIRED

The earlier three-way comparison (Metal vs Stim vs our custom RotatedSurfaceCode.syndrome_extraction_circuit) exposed a bug in our hand-rolled staggered CNOT schedule: data qubits 0, 1, 2, 4, and 7 each get touched by both an X-stab anc→data CNOT and a Z-stab data→anc CNOT in the same time step. The schedule isn't a proper graph coloring; the X-anc superposition leaks into the Z-anc and corrupts what should be deterministic Z-stab outcomes.

The bug was masked for months by our Pauli-frame symbolic tracker, which doesn't actually simulate quantum dynamics — it tracks errors in a frame and reports "no errors → zero syndrome" without catching schedule-level issues. The bug only surfaced when we ran the same circuit through a real state-vector simulator (Stim) and our Metal GPU, both of which faithfully executed the buggy schedule and produced non-zero syndromes on noiseless |0⟩ data.

The honest call: we retired the custom schedule from the production path. The custom code remains as a teaching artifact (the Pauli-frame execution model is still useful conceptually). Production surface code work uses Stim's canonical circuit driven through the Metal endpoint — which is now validated end-to-end.

Next

Status: Simulated on Mac Mini M4 using exact Stim circuits + MWPM decoding. Threshold 0.7–0.9% confirmed. As of 2026-05-30: Full hybrid surface code loop validated end-to-end on Apple Silicon — Metal GPU runs the gate sequence (H + staggered CNOTs + injected Pauli noise + MR ancilla measurements), classical pymatching MWPM decoder consumes detector events from Stim's m2d converter, logical observable check closes the loop. Three-stage validation passes: noiseless (decoder no-op), single deliberate X error injection (decoder correctly identifies and cancels), and statistical 200-shot match against pure Stim+pymatching reference within 3σ. 26-36 ms per d=3 r=3 shot on M4 GPU. Earlier d=3 r=1 syndrome extraction validation (all 4 deterministic Z-stab bits match Stim every shot, 13 random positions match distributions) still stands. K-weighted MWPM tested against error-rate-weighted MWPM (prior art) on FakeKolkata real calibration data at 100k shots — matches, does not beat (operational claim, not novel decoder).

Reproduce: pip install stim pymatching begump · source at github.com/LacobusGump/music2.0