Quantum computers are noisy. Every qubit has a ~0.1–1% chance of flipping per operation. A calculation that needs a million operations on a million qubits fails completely — the errors compound faster than the computation.
Error correction solves this by encoding one logical qubit from many physical ones. If one physical qubit flips, the others vote it down. The logical qubit survives. The more physical qubits you use, the more errors you can survive — and the improvement is exponential.
A distance-3/5/7 surface code simulator running on a $499 Mac Mini. We used Stim (Google's quantum circuit simulator) and PyMatching (minimum weight perfect matching decoder) to find the exact threshold — the error rate below which adding more qubits always helps.
Threshold: 0.7–0.9% physical error rate. Below this, every step up in code distance exponentially suppresses the logical error rate. At 0.1% physical error (achievable on current hardware):
Each distance step multiplies suppression. That's the miracle of quantum error correction — you trade physical qubits for logical reliability and the trade gets exponentially better as you scale.
Standard MWPM assumes all qubits are equally reliable. They're not. Real hardware has qubit-to-qubit variation — some have longer T2 times, some are noisier. K-weighted MWPM uses gump.k_measure() to assign coupling strength to each qubit, then weights the matching graph accordingly.
Honest verdict (2026-05-30, FakeKolkata three-way test, 100k shots per arm): K-weighted MWPM is statistically indistinguishable from error-rate-weighted MWPM (Higgott 2022 + Google surface code papers) when both are derived from the same calibration data. We do not beat the prior art on this hardware snapshot at d=3.
What survives: K-weighted MWPM matches the explicit-calibration decoder using only K (coherence + inverse-error + coupling-graph term). This means a GUMP-derived K proxy can stand in for a full per-gate calibration table when one isn't dumped every hour — useful for operational settings with partial observables, not a novel decoder claim.
The same K that describes music, brain synchronization, and protein folding produces a usable noise proxy for quantum decoders. That's a real operational story; it's not a new decoder.
The rotated surface code encodes 1 logical qubit from d² physical data qubits plus (d²-1) ancilla qubits for syndrome measurement. Distance d corrects up to ⌊(d-1)/2⌋ errors per round.
100,000 shots per point. Depolarizing noise model. Stim circuit: surface_code:rotated_memory_z, rounds=d.
| p_phys | d=3 (9q) | d=5 (25q) | d=7 (49q) | verdict |
|---|---|---|---|---|
| 0.10% | 0.054% | 0.012% | 0.001% | ✓ suppressed |
| 0.30% | 0.422% | 0.145% | 0.044% | ✓ suppressed |
| 0.50% | 1.066% | 0.741% | 0.389% | ✓ suppressed |
| 0.70% | 1.935% | 1.814% | 1.528% | ✓ suppressed |
| 0.90% | 3.128% | 3.550% | 3.673% | ✗ above threshold |
| 1.00% | 3.735% | 4.864% | 5.339% | ✗ above threshold |
| 3.00% | 22.35% | 37.64% | 46.47% | ✗ above threshold |
Threshold: between 0.7% and 0.9%. Current superconducting hardware (IBM, Google) operates at ~0.1–0.3% — solidly below threshold.
Below threshold, logical error rate scales as p_L ~ (p/p_th)^((d+1)/2):
| Distance | Scaling exponent | At p=0.3% | Suppression vs d=3 |
|---|---|---|---|
| d=3 | p² | 0.422% | — |
| d=5 | p³ | 0.145% | 2.9× |
| d=7 | p⁴ | 0.044% | 9.6× |
Standard MWPM assigns uniform edge weights w = -log(p/(1-p)). K-weighted MWPM adjusts per qubit using a GUMP-derived noise proxy:
We ran the head-to-head comparison: three decoders (uniform MWPM, error-rate-weighted MWPM, K-weighted MWPM) on FakeKolkata real per-qubit calibration data at 100,000 shots per arm. Wilson 95% CIs computed for every point.
| Decoder | Source of weights | vs error-rate-weighted |
|---|---|---|
| Uniform MWPM | none (all weights equal) | loses to both, baseline strawman |
| Error-rate-weighted MWPM | explicit per-gate calibration | — (prior art, Higgott 2022) |
| K-weighted MWPM | coherence + inverse-error + coupling-graph proxy | statistically indistinguishable (matches, slightly worse at highest p) |
Verdict: K-weighted is not a novel decoder beating prior art. It is a different routing to the same operating point as the calibrated decoder, using a derived proxy instead of explicit calibration tables. That's an operational claim, not a research claim.
d=3 r=3 and d=5 r=5 under uniform depolarizing noise (Stim's rotated_memory_z with after_clifford_depolarization, after_reset_flip_probability, before_measure_flip_probability, before_round_data_depolarization all set to p). 10,000 shots per point, pymatching MWPM on Stim's auto-generated detector error model.

| p_phys | d=3 r=3 | d=5 r=5 | encoding helps? |
|---|---|---|---|
| 0.05% | 4.0 × 10−4 | 0 / 10k | yes, d=5 < 10−4 |
| 0.10% | 4.0 × 10−4 | 1.0 × 10−4 | yes, d=5 is 4× better |
| 0.20% | 2.8 × 10−3 | 1.8 × 10−3 | yes |
| 0.50% | 1.88 × 10−2 | 1.56 × 10−2 | yes, marginal |
| 0.80% | 4.23 × 10−2 | 4.52 × 10−2 | crossing — threshold |
| 1.00% | 5.57 × 10−2 | 8.44 × 10−2 | no, above threshold |
| 2.00% | 1.70 × 10−1 | 3.01 × 10−1 | no |
Threshold sits between 0.5% and 0.8% physical error rate for this circuit-level depolarizing noise model. d=5 suppresses below d=3 at low p (the encoding genuinely helps), they cross near threshold, d=5 fails harder above. Canonical surface code threshold plot, reproduced from first principles in 30 seconds of Mac Mini compute. Artifact: tools/threshold_stim_2026-05-30.json.
The Metal backend exposes an arbitrary-circuit endpoint that accepts a JSON sequence of H, CNOT, and MEASURE operations and returns measurement bits plus the post-collapse state vector. Validated in two passes against Stim ground truth.
Mid-circuit Z-basis measurement + state collapse was added to the GPU kernel (host-side projection after each MEASURE op, in-place renormalization). We then drive Stim's canonical d=3 r=1 circuit (17 qubits, 8 H + 24 CNOT + 17 MEASURE ops including 8 measure-and-reset for syndrome ancillas) through both Stim's compiled sampler and the Metal /circuit endpoint at 200 shots each, comparing the resulting 17-bit measurement strings.
This is the milestone the day was aiming at. The GPU path now does the full thing: H + staggered CNOT ladders + MEASURE + collapse, validated bit-for-bit on the deterministic measurements and distribution-for-distribution on the random ones, against Stim as the independent oracle. Artifact: tools/d3_full_syndrome_validation_2026-05-30.json.
The hybrid loop below (one HTTP call = entire circuit) cleared the first bar. The next bar is the architecture that real quantum computers run on: per-round measurement → decoder → correction → next round, where the state lives on the device and only classical bits cross the wire. We extended quantum_server_metal with three new endpoints — /session_start, /session_run, /session_end — so the GPU state vector persists across calls without ever serializing back to the client.
The architectural breakthrough: state never leaves the GPU between rounds. An earlier version that shipped the 1.4 MB base64 state vector back-and-forth per round paid ~25 ms per round in marshaling — making real-time three times slower than batch and producing a working-but-expensive proof of concept. The session-ID approach drops the cost to ~1-2 ms per round (pure HTTP overhead, no state transport), bringing real-time within striking distance of batch on the wall clock.
The decoder fits inside the per-round budget: pymatching MWPM runs in <1 ms for d=3, so closing the loop with per-round decode + Pauli correction feedback adds ~3 ms total at r=3 (well below the per-round GPU time). The foundation is now there for actual real-time decoder feedback on Apple Silicon at research scale: per-round GPU work + per-round decode + correction injection back into the persistent GPU state, all under ~15 ms per round.
Architecture validated: stateful sessions, per-round execution, correct physics. Artifacts: tools/wall4_session_d3r3.json, tools/wall4_session_d3r5.json.
The next layer: actually close the loop. After each round's syndrome bits arrive, parse them into detector events using Stim's detector formulas, run pymatching MWPM on accumulated detectors, compute the delta vs the previous round's correction, and inject the new Pauli corrections (X/Z on data qubits) back through the same live GPU session before the next round begins. This is the architecture a real-time-decoded quantum computer runs.
| Pipeline | d=3 r=3 wall time | d=3 r=5 wall time | Per-round decode | Per-round feedback |
|---|---|---|---|---|
| Real-time (stateful, decode at end) | 35.5 ± 10.3 ms | 53.5 ± 9.3 ms | — | — |
| Closed-loop (per-round decode + feedback) | 31.2 ± 1.4 ms | 47.1 ± 2.0 ms | 22-33 µs | 16-20 µs |
| Batch reference (single-shot) | 29.9 ± 1.0 ms | 44.9 ± 1.6 ms | — | — |
| Δ closed-loop vs batch | +1.3 ms total | +2.2 ms total | — | — |
What this shows: the full architecture — stateful GPU session + per-round measurement + per-round MWPM decode + per-round Pauli correction injected back into the live GPU state — adds 1.3 ms total at r=3 and 2.2 ms total at r=5 versus the single-shot batch reference. The per-round decode+feedback budget is ~40 microseconds combined. Surface code rounds on real hardware run at microseconds-to-milliseconds; our software loop fits inside that budget with three orders of magnitude to spare.
The decoder is computing detectors per-round (from Stim's parsed DETECTOR formulas, walking the bit history as it arrives), running MWPM on the partial detector array, computing the correction delta vs prior round, and injecting X/Z ops back through the live session — all while the GPU state for the next round is being prepared. Pymatching's MWPM on the d=3 graph runs in tens of microseconds; the Pauli kernel dispatch (X kernel on a chosen data qubit) is ~microsecond-scale. End-to-end: per-round real-time decoded surface code execution on commodity Apple Silicon.
What remains to claim "faster than Stim under real-time constraints": Stim has no real-time decoded execution loop. It samples a full circuit then decodes. There is no public benchmark for "Stim with closed-loop per-round decoder feedback during execution" because Stim doesn't have one — the comparison is between architectures, not just simulators. This is the wall that just fell. Artifacts: tools/wall4_closedloop_d3r3.json, tools/wall4_closedloop_d3r5.json.
What's not yet validated (honest): these runs are noiseless (zero Pauli noise injected into the circuit), so the decoder always sees zero detectors and applies zero corrections. The wall-time budget is the real claim — decoder + feedback fits comfortably inside the per-round GPU budget. Demonstrating closed-loop correction in the presence of noise (detector events firing and corrections actually applied to data qubits) is the obvious next step; a small demo harness has been written that wires the existing noise injector into the closed-loop trial.
Next (in progress): noise-injected closed-loop runs where the decoder must actually catch errors and corrections flip data qubits on the live GPU state. The architecture (stateful sessions + per-round decode + delta feedback) is already proven on the timing budget; adding realistic noise is the final piece to show a complete real-time decoded surface code loop on commodity Apple Silicon.
With the Metal endpoint validated against Stim's canonical d=3 syndrome extraction, the next layer is the closed loop: GPU state evolution → mid-circuit measurement → MWPM decoder → logical observable check, all in one runtime. To support this we added Pauli X/Y/Z kernels and measure-and-reset (MR) semantics to quantum_server_metal, then built a driver that does the full thing for d=3 r=3.
| Stage | Test | Result | Per-shot wall time |
|---|---|---|---|
| A | Noiseless d=3 r=3: 0 detectors, 0 logical flips, decoder no-op | PASS | 36 ms |
| B | Single X error injected mid-circuit on data qubit: 1 detector fires, decoder identifies and cancels the flip, logical state preserved | PASS | 28 ms |
| C | Statistical, 200 shots at p=0.003 depolarizing: logical error rate from Metal hybrid matches Stim+pymatching reference within 3σ | PASS | 26 ms |
Stage B is the proof the loop is real. A deliberate X error fires exactly one detector event; pymatching reads the detector pattern, predicts the logical flip; the prediction XORed against the actual observable returns 0 — meaning the decoder correctly identified the error type and would have applied the right correction if we were continuing the computation. The state evolution happens on the GPU, the decision-making happens classically, the loop closes.
This is the concrete realization of "Apple hosts the full hybrid stack at research scale." Every gate (including injected Pauli noise), every measurement, every state collapse happens on the M4 GPU. Stim and pymatching only enter for their roles as structural reference (detector formulas) and decoder. Artifact: tools/hybrid_loop_2026-05-30.json.
The earlier three-way comparison (Metal vs Stim vs our custom RotatedSurfaceCode.syndrome_extraction_circuit) exposed a bug in our hand-rolled staggered CNOT schedule: data qubits 0, 1, 2, 4, and 7 each get touched by both an X-stab anc→data CNOT and a Z-stab data→anc CNOT in the same time step. The schedule isn't a proper graph coloring; the X-anc superposition leaks into the Z-anc and corrupts what should be deterministic Z-stab outcomes.
The bug was masked for months by our Pauli-frame symbolic tracker, which doesn't actually simulate quantum dynamics — it tracks errors in a frame and reports "no errors → zero syndrome" without catching schedule-level issues. The bug only surfaced when we ran the same circuit through a real state-vector simulator (Stim) and our Metal GPU, both of which faithfully executed the buggy schedule and produced non-zero syndromes on noiseless |0〉 data.
The honest call: we retired the custom schedule from the production path. The custom code remains as a teaching artifact (the Pauli-frame execution model is still useful conceptually). Production surface code work uses Stim's canonical circuit driven through the Metal endpoint — which is now validated end-to-end.
pip install stim pymatching begump · source at github.com/LacobusGump/music2.0