Every idea we killed. What we tried, how we tested it, why it broke. The failures are where the answers hide.
We test every claim against adversarial input, ablation, known ground truth, and the opposite hypothesis. What survives is on the research pages. What didn't is here. Organized by topic so you don't walk the same dead ends.
PROTEIN PATHOGENICITY
Three-T profiler beats single T_fold scorer
Built T_amyloid (aggregation) and T_polymer (polymerization) alongside T_fold (structural). Combined verdict: max of three T values drives pathogenic/benign call.
79% accuracy vs 83% for T_fold alone. T_amyloid and T_polymer added false positives. 86% FP rate on random mutations. The extra signals were noise, not signal. T_fold alone is the product. The others ship as informational flags only.
15 structural signals beat 2 signals (K × conservation)
Built 15 physics-based scoring signals: surface patches, charge networks, aggregation, flexibility, secondary structure, metal binding, catalytic sites, packing, salt bridges. Each caught specific variants the others missed.
On a balanced dataset (87 benign + 1,557 pathogenic), 15 signals gave 78.3% balanced accuracy. Two signals (K × conservation) gave 85.7%. Every "clever" signal we added caught one more pathogenic variant but also flagged more benign variants. The fills were noise. The groove was always K and conservation.
Gene prior (training label frequency) as pathogenicity signal
Used the fraction of known pathogenic variants per gene (p53 = 100%, BRCA1 = 53%) to weight the score. p53 at prior=1.0 caught 99.5% recall.
Circular reasoning. The prior was computed FROM the training labels. At prior=1.0, every variant in every gene gets called pathogenic. Specificity drops to 0%. Replaced with auto-computed protein intolerance from conservation data — not from training labels.
"96.9% accuracy" as a meaningful metric
Reported 96.9% accuracy on 1,594 ClinVar variants.
The dataset was 97.7% pathogenic. A classifier that says "pathogenic" to everything gets 97.7%. Our 96.9% was WORSE than that trivial baseline. Accuracy is meaningless on imbalanced datasets. Switched to balanced accuracy (recall + specificity / 2).
Ensemble random walk folds improve K accuracy
Ran 30-50 random walk protein folds with different seeds, averaged the contact maps. Theory: systematic bias of one seed averages out across many.
The averaging smoothed out useful bias along with noise. The single fold at seed=42 happened to have correlated errors — wrong in ways that accidentally helped. The ensemble removed both the good and bad errors. Single fold was better.
Machine (Kuramoto oscillators) predicts pathogenicity
Mapped protein to coupled oscillators (one per residue, contacts = coupling edges). Mutated one oscillator's frequency. Measured global order parameter R before and after. Theory: pathogenic mutations destabilize R.
Proteins are too coupled. 393 oscillators with ~1000 edges — changing one oscillator's frequency barely moves global R (ΔR < 0.01). The network absorbs single-site perturbation. Also tried: neighborhood subgraph, multi-channel (4D oscillators), compressed clusters at prime N values. All gave ΔR below noise floor. Phase synchronization ≠ functional disruption.
Conservation shape (close vs distant orthologs) as signal
Split orthologs into close (mammals) and distant (fish/invertebrates). Measured slope: recently constrained positions (high close, low distant) vs anciently constrained (flat). Theory: FP benign are recently constrained, TP pathogenic are anciently constrained.
The slope signal was real in the raw analysis (gap +0.012) but too noisy when multiplied into the score. Ortholog alignment quality degrades at high evolutionary distance. The BLOSUM local alignment at 15-residue windows can't reliably score distant species. Needs full MSA or HMMER to be useful.
BLOSUM substitution score as pathogenicity multiplier
BLOSUM62 separates FP from TP on average (gap 0.515 — biggest separator found). Used as continuous sigmoid weight: foreign substitutions (BLOSUM << 0) get full score, conservative (BLOSUM > 0) get discounted.
Many pathogenic variants have moderate BLOSUM scores (R→K = +2, D→E = +2). The discount killed them. BLOSUM separates the MEANS but the distributions overlap completely in the -2 to +1 range. Works on average, fails on individuals.
PDB interface residues separate benign FP from pathogenic TP
Downloaded crystal complex structures for 12 proteins. Extracted exact interface residues (within 5Å of partner chain). Theory: FP benign variants cluster at interaction interfaces where context matters.
Only 5% of FP and 9% of TP are at crystallographic interfaces. Both groups are overwhelmingly NOT at interfaces. The "seats" aren't where the FP problem lives. The tolerance is intrinsic to the protein, not dependent on partners.
Fragility index (multi-substitution variance) separates tolerant from fragile positions
For each position, computed ΔK for all 19 possible substitutions. High variance = fragile (many changes hurt). Low variance = robust (only specific changes hurt). Theory: FP benign are at robust positions.
FP fragility = 0.303, TP fragility = 0.301. Identical. Both FP and TP positions are equally fragile on average. The difference isn't in how many substitutions hurt — it's in something else entirely.
Contact redundancy (network rerouting capacity) separates FP from TP
Measured how connected a position's neighbors are to each other (without going through the position). High redundancy = network can reroute = robust = benign.
FP redundancy 0.369, TP redundancy 0.404. Pathogenic positions have HIGHER redundancy. Opposite of hypothesis. Highly connected regions are both important AND well-networked — redundancy doesn't mean tolerance.
Groove fit (K relative to neighborhood) separates FP from TP
Measured whether each position's contact count matches its local neighborhood's average. Theory: FP benign variants "fit the groove" (K matches neighbors), TP pathogenic disrupt the local pattern.
Benign groove fit 1.010, pathogenic groove fit 1.009. Identical. Both sit perfectly in the pocket. The difference between tolerated and pathogenic isn't whether they fit the local pattern — it's something global.
94.7% balanced accuracy on pathogenicity scoring
Reported 94.7% balanced accuracy on 1,594 ClinVar variants with a 6-signal scorer (K, conservation, ΔK, propagation, functional context, gnomAD). Identified and fixed a 1-indexing bug. Tuned threshold and feature weights on the evaluation data.
The 94.7% was inflated by three compounding errors. (1) The threshold was optimized on the same data used for evaluation — in-sample fitting. (2) AUC-proportional feature weights were computed from the test set. (3) A gene-level confounder: p53 variants (91% of pathogenic data) have systematically higher scores than AR variants, so the scorer was mostly classifying “is this p53?” not “is this variant pathogenic?” Under strict leave-one-gene-out cross-validation with predeclared weights, the honest within-gene AUC is 0.74. Still matches SIFT with zero training, but not the headline we published.
Products of features (stiff × cons, damage × func × chem) improve accuracy
Multiplied structural channels together: stiffness × conservation, damage × functional_proximity × chemistry. Greedy forward selection chose the best product combinations. AUC reached 0.875 on the evaluation set.
0.875 collapsed to 0.610 under leave-one-gene-out cross-validation. The products amplify gene-level confounders: they look great when the channels align in-sample, then explode out-of-sample. Any step that selects features, tunes weights, or chooses thresholds from the evaluation data leaks the answer. Banned: greedy selection, AUC-derived weights, multiplicative stacking on full data.
Deeper MSA (more sequences) improves tolerance signal
Compared shallow MSA (~2,000 sequences from UniRef, capped) vs deep MSA (~10,000 from ColabFold uncapped) vs environmental sequences (BFD/MGnify). Theory: more evolutionary data = more precise substitution tolerance.
Deep was identical to shallow. Environmental was WORSE (0.567 vs 0.648). The first ~2,000 sequences already saturate the column frequencies. Additional metagenomic sequences add distant homologs that blur the tolerance signal with noise from proteins under different functional constraints. More data ≠ better data.
Coevolutionary coupling (mutual information / DCA) predicts pathogenicity
Computed pairwise mutual information between MSA columns for each residue and its spatial neighbors. High MI = co-evolved = allosteric hub = important. Theory: mutations at high-MI positions should be more pathogenic.
Mean AUC 0.302 — ANTI-predicts. Pathogenic variants hit positions with LOW mutual information. High-MI positions are the tightly co-evolved structural core — evolution keeps them locked together. The allosteric “control knobs” for gain-of-function are exactly the positions that AREN’T co-evolved. MI measures structural constraint, not functional vulnerability.
Exact Fiedler damage (Δλ&sub2; by node removal) beats the approximation
Computed the exact change in algebraic connectivity when each residue is removed from the weighted contact graph. Theory: the real number should be more precise than the perturbation theory estimate.
0.369 AUC vs 0.817 for the approximation. The approximation (|v&sub2;|² × degree / λ&sub2;) is BETTER because it combines two signals: WHERE in the topology (Fiedler participation) and HOW CONNECTED locally (degree). The exact Δλ&sub2; conflates these. The “wrong” formula captures the right physics — a node matters when it’s both topologically critical AND locally connected.
Structural coupling alone predicts double mutation interaction type
Two mutations in the same protein: if both high-K and structurally close (<15Å), predict synergistic. If far apart, predict additive. Tested on 5 known double mutations (TP53, BRCA1, EGFR, BRAF, PIK3CA).
40% accuracy (2/5). Catches same-region synergy (TP53 175+248 at 12.7Å) but misses cross-domain interactions entirely. BRCA1 61+1775 are 50.8Å apart in different domains — both pathogenic but not structurally coupled. EGFR T790M+L858R is compensatory (resistance undoes activation) — needs functional direction, not just distance. Double mutations need allosteric pathway modeling and gain/loss-of-function annotation, not just K.
COMPUTE & PHYSICS
500 contacts = 500 independent Landauer bits in protein folding
Calculated Landauer cost of protein folding as 500 contacts × 1.85 bits/contact = 925 bits. Predicted TΔS = 1,597 kJ/mol. Claimed proteins pay 22× above Landauer minimum.
14.5× overcounting. Contacts are not independent constraints. Many contacts are redundant (if A touches B and C, and B touches C, only 2 of 3 are independent). Correct model: count BITS PER RESIDUE, not per contact. Core residues (~30%) lose ~2.0 bits, intermediate (~40%) lose ~0.5 bits, surface (~30%) lose ~0.1 bits. Average: 0.67 bits/residue. Lysozyme (129 residues) = 87 bits. Predicted TΔS = 150 kJ/mol. Measured: ~150 kJ/mol. Match: 1.0×. The fix proved the claim MORE strongly — Landauer matches measured conformational entropy exactly when you count independent constraints correctly.
57.71 TFLOPS on Mac Mini M4
Measured GPU FMA throughput with 4096 FMAs per element across 4 independent chains. Counted ops as 4096 × 4 chains × 2 (half2) × 2 (mul+add) = 65,536 per element. Added ANE int8 concurrent.
4× op counting error. The "4 chains" were already included in the 4096 FMA count. Correct ops = 4096 × 4 = 16,384. Also verified: ILP (1/2/4 independent chains) gives identical throughput on M4 GPU — no pipeline latency to hide. Real GPU peak: 3.5T fp16. Real combined: ~18T. The trampoline dispatch architecture is real (improves workload throughput) but doesn't change the silicon's peak TOPS.
Tetrahedral pipeline (3-stage split) beats monolithic kernel
Split computation into 3 stages (load → compute → reduce) with separate encoders per stage. Theory: cache stays warm between stages, each stage runs on data the previous stage loaded.
0.50× of monolithic. Three separate makeComputeCommandEncoder() calls per round creates overhead that dominates the benefit. The GPU pipeline flushes between encoders. The implicit pipeline in a single encoder is faster than explicit staging.
Golden spiral chunk sizing beats uniform chunks
Sized GPU dispatch chunks by golden angle spacing instead of uniform. Theory: different-sized dispatches fill different pipeline stages.
0.99×. The GPU processes each dispatch independently regardless of size. Pipeline stages fill the same way whether the dispatch is 357K or 577K threads. The spiral is structural, not temporal.
half4 (4-wide) beats half2 (2-wide) vectors
Packed 4 fp16 values per register instead of 2. Theory: wider SIMD = more ops per instruction.
Slightly worse (-1%). The M4 GPU compiler already optimizes half2 operations. Wider packing adds register pressure without improving throughput. The ALU width is fixed.
CPU NEON contributes meaningful TFLOPS alongside GPU + ANE
Ran CPU matrix multiplication concurrent with GPU trampoline dispatch.
~0.00 TFLOPS measured. Naive Python/C matmul is too slow to register. Accelerate/BLAS might help but CPU is better used for orchestration than computation on M4.
NMC cathode Pareto frontier has zero cobalt
Screened 66 NMC compositions. Showed 4 named compositions on the Pareto frontier, all with zero or minimal cobalt. Claimed this matches industry trend.
Full Pareto analysis: 17 of 28 Pareto-optimal compositions INCLUDE cobalt. The zero-cobalt result was an artifact of showing only named compositions (811, 622, 532, 111) instead of the full frontier. Under alternative stability weights (cobalt more stabilizing), 19/21 Pareto points include cobalt. Industry confirms: cobalt is being reduced but Samsung SDI NMC 622 and Panasonic NCA both still use it. The stability weights are assumed, not measured.
Perovskite screening uses correct bandgaps
Screened 1,352 perovskite compositions. Reported FASnI3 bandgap as 1.85 eV, scored it zero. Claimed FAPbI3 wins from "pure physics."
Literature FASnI3 bandgap is 1.41 eV, not 1.85 eV. Our formula was wrong for Sn-based perovskites. FASnI3 is actually CLOSER to the optimal 1.34 eV than FAPbI3 (1.48 eV). FAPbI3 wins because Sn oxidizes in air (stability filtering saves us), not because our bandgap model is accurate. Also missing: mixed compositions (FA0.95Cs0.05PbI3) hold actual efficiency records but weren't in our search space.
CTNNB1 is the master hub of cancer signaling
Built 9-pathway network with 19 cross-pathway edges. CTNNB1 had degree=10, K=1.000. Called it "the master hub." Knockout cascaded through 5 pathways, 17 proteins.
We added 5 crosstalk edges TO CTNNB1 (from SMAD3, SMAD4, YAP1, TAZ, GLI1), doubling its degree from 5 to 10. We made it the hub by construction. In STRING database, TP53 has ~6,000 interaction partners vs CTNNB1's ~1,200. TP53 is the real master hub. With TP53 crosstalk edges instead (8 known), TP53 would have degree 13, beating CTNNB1. Our ranking reflects our edge selection, not biology.
29/30 threat detection validates the security engine
Designed 21 combination signatures matching 30 attack scenarios. Achieved 29/30 detection, 0/10 false positives.
Self-referential: we designed the signatures and test cases simultaneously. The 29/30 is a self-fulfilling prophecy. 0/10 FP used 10 hand-picked normal scenarios, not real workloads. Five novel attack patterns (living-off-the-land, slow exfil, insider threat, encrypted C2, subtle supply chain) would all evade. Any attacker who knows the required features can avoid them. V1 weighted sum with tuned per-category thresholds was never tested as an alternative.
ANE classifier pipeline accelerates our screening engines
Apple Neural Engine does 15 TOPS int8. Built tiny classifiers (6-10 features, 2-layer, <1KB) for mutation scoring, battery screening, perovskite screening. Theory: ANE dispatch beats CPU for classification step.
Our models are too small (<1KB). ANE dispatch overhead (~50μs) exceeds model compute time. The bottleneck is feature computation (contact map lookup, conservation query, propagation), not the final score formula. ANE helps only in batch mode with precomputed features (>1000 samples). For single-variant online scoring, CPU is faster. The right tool for the job, not the fastest tool available.
NEUROSCIENCE / AUTISM MODELING
Toy random matrix model captures ASD/typical separation
Built random matrix model of neural connectivity. Theory: random coupling at different densities would show measurable ASD vs typical separation via spectral properties.
Random matrices don't capture the biology. ASD overcoupling is not uniform random coupling — it's selective pruning failure. The toy model couldn't reproduce the separation because the relevant structure (hierarchical pruning) was absent from the model.
Global R separates ASD from typical at K=1.0
Measured global order parameter R (Kuramoto synchronization) on simulated neural networks at K=1.0. Theory: ASD networks should have different R than typical networks due to overcoupling.
At K=1.0, both ASD and typical topologies synchronize completely. Global R saturates near 1.0 for both. The signal is drowned. K=1.0 is above the synchronization threshold for both network types. Would need K near the critical point to see separation, but the critical K depends on the topology — circular reasoning.
Harmonic ratio separates ASD from typical networks
Computed the ratio of harmonic frequencies in the oscillator dynamics on ASD vs typical topologies. Theory: overcoupled networks produce different harmonic structure.
t-statistic = -1.58, variance too high. The harmonic ratio fluctuates too much between runs for meaningful separation. The signal-to-noise ratio is insufficient. Even when the means differ, the distributions overlap completely.
Ego feedback model differentiates topologies
Added ego-network feedback: each node adjusts its coupling based on its local neighborhood density. Theory: ASD (dense neighborhoods) and typical (sparse neighborhoods) would diverge under self-reinforcing feedback.
The feedback didn't differentiate topologies. Both dense and sparse neighborhoods converge to the same attractor under the ego feedback rule. The local feedback loop was too simple to capture the multi-scale nature of pruning.
Module walls form and persist after drives removed
Applied external driving forces to create modular structure (simulating developmental pruning). Removed drives. Theory: the modular walls would persist — structural memory from transient coupling.
Module walls form during driving but don't persist after drives are removed. The system relaxes back to its natural topology. The Kuramoto model has no structural memory — coupling is fixed, only phases evolve. Real neural pruning physically removes synapses (structural change). Phase dynamics can't simulate structural deletion.
PHYSICS
Cochlea is a golden spiral
Hypothesized the inner ear's shape follows the golden ratio, connecting music perception to φ.
Manoussaki et al. 2006 measured real cochleae. The spiral is logarithmic but NOT golden. The ratio varies between species and doesn't converge on φ. Beautiful idea, empirically wrong.
Dark matter as Landauer heat from cosmic computation
If the universe computes, every bit erasure costs kT ln(2). The accumulated heat would look like missing mass.
Energy conservation violation. Landauer heat doesn't create new mass-energy — it converts existing energy to heat. The heat is already counted in the energy budget. Dark matter requires new mass, not redistributed energy.
Rainbow operators V1 through V4 for prime distribution
Five different Schrödinger-type operators designed to have eigenvalues at zeta zero positions. Each used a different potential function.
All five failed. Schrödinger operators process information additively (superposition). Prime decomposition is multiplicative (Euler product). The natural language is scattering/transfer, not eigenstates. Direct quantization using the multiplicative structure works; operator approaches don't.
SUSY as Kuramoto phase transition
Hypothesized supersymmetry breaking maps to the Kuramoto synchronization transition at K = K*.
Unit error. Kuramoto K is dimensionless (~1.868). SUSY breaking scale is in GeV (~10³). The analogy confuses coupling strength with energy. There's no dimensional bridge.
137 = Spin(16) + SU(3) + U(1)
Decomposed 137 into gauge group dimensions: 120 + 8 + 1 + 8 = 137.
Cherry-picked. Excluded SU(2) (dimension 3) which breaks the sum. Including all Standard Model gauge groups gives 120 + 8 + 3 + 1 = 132, not 137. Numerology, not physics.
K × N = 256 for all N
K* × 137 ≈ 256. Hypothesized this product is constant across all oscillator counts.
Only works at N = 137. At other N values, K* changes but K*×N varies from 34 to 256. The relationship is specific to N = 137, not universal.
Mass spectrum follows α^n ladder
Particle masses as powers of the fine structure constant: m ∝ α^n for integer n.
94% of random bases do equally well. The "fit" is an artifact of having many particles and many powers to choose from. Monte Carlo simulation showed the pattern is not statistically significant.
K predicts fluid behavior
Applied K/R/E/T to 2D and 3D Navier-Stokes simulations. Measured K and R across forcing configurations.
K DESCRIBES but doesn't PREDICT. R = 1/φ is not universal in turbulence (R varies 0.02 to 0.81 depending on forcing). K×Re is not constant. K doesn't predict spectral exponent. The framework measures fluids accurately but has no predictive power beyond Reynolds number alone. The GPU solver (82M pts/sec) ships. The K/R framework for fluids does not.
Every failure is documented because they save time for anyone walking the same paths. And because the wrong answers often contain the right questions. The BLOSUM failure showed us that evolutionary substitution tolerance IS the strongest separator — we just can't use it as a multiplier. The Machine failure showed us that phase synchronization and functional disruption are different regimes — which told us exactly where the 19% accuracy gap lives.
Jim McCandless, beGump LLC. The best ideas are the ones that survive trying to kill them.