← Research

Mutation Scanner

Variant pathogenicity + GoF/LoF regime detection — 1,594 variants, 15 proteins, zero training
JIM’S OVERSIMPLIFICATION

Some amino acids are bridges holding two parts of a protein together. Mutate the bridge, the protein falls apart. But some mutations do not break anything — they flip switches. Knowing whether the mutation is breaking a wall or flipping a switch is the whole game. We built a detector that figures out which ruler to use before scoring.

WHAT THIS DOES

You have a genetic mutation. You want to know: does it matter? The scanner answers by asking two things. First: how structurally important is this position? (Fiedler damage — how much the protein's connectivity collapses when you pull one node.) Second: has evolution preserved this position? If every species from fish to humans has the same amino acid here, it is probably important.

But there is a deeper problem. Loss-of-function mutations follow one rule (structural damage predicts disease). Gain-of-function mutations follow the opposite rule (disease mutations hit control sites, not load-bearing walls). If you do not know which regime you are in, your scores are meaningless. The regime detector solves this: 8 out of 10 correct, zero training data.

HOW WELL IT WORKS

0.74 AUC across 6 disease proteins (leave-one-gene-out). Matches SIFT (2001). Does not match AlphaMissense (0.94 AUC, trained on 100M sequences). The value is not raw accuracy — every score is traceable to a physical mechanism. Fiedler network damage alone achieves 0.82 AUC. One number. No training. Pure graph theory.


THE NUMBERS

Regime-aware scorer:
Within-gene AUC: 0.74 (mean across 6 proteins, leave-one-gene-out)

Per-gene:
EGFR: 0.92 | RET: 0.85 | BRAF: 0.75 | p53: 0.66 | MTHFR: 0.63 | AR: 0.60

Novel feature:
Fiedler damage: 0.82 AUC as single feature under LOGO cross-validation

Speed:
Online: 30 variants/sec | Precomputed: 216,211 variants/sec

Previous claims (94.7%, 83.4%) were inflated by gene-level confounders and in-sample weight optimization, discovered and corrected in Session 23. The current 0.74 is validated under strict leave-one-gene-out with no learned weights.

WHAT IT CATCHES

VariantDiseaseScoreMechanism
KRAS G12DLung/pancreatic cancer0.545GTPase P-loop disruption
p53 R175HCancer (#1 hotspot)0.254Metal site + charge loss
HBB E6VSickle cell0.570Surface hydrophobic patch
SOD1 A4VALS0.180Buried packing change

Benign variants correctly identified:

p53 P72R (AF=0.72): 0.000 — gnomAD filter
BRCA1 P871L (AF=0.36): 0.000 — gnomAD filter
HBB E6D (conservative): 0.000 — same charge, same size

VS THE FIELD

ToolAUCTraining data
SIFT (2001)0.69–0.74Conservation
PolyPhen-2 (2010)0.75–0.81Conservation + structure
CADD v1.6 (2019)0.82–0.87Genome-wide meta-predictor
REVEL (2016)0.90–0.94ClinGen-calibrated ensemble
AlphaMissense (2023)0.94–0.96AlphaFold + 100M sequences
GUMP (2026)0.74Fiedler damage + MSA + physics

GoF/LoF REGIME DETECTOR

The two-ruler problem: a universal scorer that treats all genes the same will score GoF genes backwards. The detector identifies which ruler to use before scoring.

Loss-of-Function (LoF):
  Pathogenic mutations break the structural core.
  Damage correlates with pathogenicity.
  Examples: TP53, BRCA1, ATM, FBN1

Gain-of-Function (GoF):
  Pathogenic mutations hijack functional control sites.
  High-coupling, LOW-damage sites = pathogenic.
  Examples: BRAF, EGFR, AR, PIK3CA
GeneDetectedExpectedConfK-Dmg Corrn
TP53LoFLoF0.48+0.0561,331
ATMLoFLoF1.00+0.16532
FBN1LoFLoF1.00+0.01134
EGFRGoFGoF1.00-0.39223
ARGoFGoF0.81+0.05357
PIK3CAGoFGoF1.00-0.9136
BRAFGoFGoF0.29+0.0468
Accuracy: 8/10 correct on known-regime genes

BRCA1 misclassified: LoF through surface disruption looks like GoF to a structural-only detector (conf 0.40). Genuine limit.

THE SIGNALS

Verdict (2 signals):
  1. K — contact degree at position (size-normalized)
  2. Conservation — BLOSUM62 ortholog alignment (61-157 species)
  Filter: gnomAD population frequency (AF ≥ 1% → benign)

GoF/LoF detection (3 signals):
  1. K-damage correlation (strongest, weight 3x)
  2. Control knob fraction (high-K, low-damage pathogenic sites)
  3. Damage ratio (pathogenic vs random position damage)

REPRODUCIBLE

pip install begump
from gump.foldwatch import profile_mutation, detect_regime

# Score a mutation
r = profile_mutation(KRAS, 12, 'G', 'D', 'KRAS')
print(r['verdict']) # PATHOGENIC
print(r['T_fold']) # 0.545

# Detect GoF vs LoF regime
result = detect_regime('TP53', variants, sequence=sequence)
print(result['regime']) # 'LoF' or 'GoF'

HONEST LIMITS

Irreducible from sequence alone:
  Gain-of-function detail | DNA-contact mutations | Tetramer destabilization | Epistasis

GoF/LoF limits:
  BRCA1 misclassified | Small variant counts (BRAF: 8, PIK3CA: 6)
  Mixed-regime genes (RET: both MEN2A and Hirschsprung)

What IS solid:
   Fiedler damage: 0.82 AUC, single feature, zero training
   GoF/LoF: 8/10 correct, pure physics + structure
   Regime detection explains the AUC jump from 0.587 to 0.74

Jim McCandless, beGump LLC. All computation on Mac Mini M4, 16GB, 35W. No cloud. Test variants, ortholog data, gnomAD index, and validation script included with the package.

GUMPResearch · Support · [email protected]