Can coupling features improve drug discovery prediction?

Dr. ADK

Coupling-based drug discovery — 41,127 compounds, +0.055 AUC, every claim audited

JIM’S OVERSIMPLIFICATION

A drug works because it couples with its target. Everybody measures the drug. Nobody measures the coupling. We added coupling features to the standard molecular descriptors. XGBoost got better. The top coupling feature — how easily the molecule fragments — is something no standard descriptor captures. Same graph math that finds fraud in financial networks finds binding in drug targets.

K IN THIS DOMAIN

K here is molecular coupling. How tightly bonded the atom graph is. High K = robust connectivity = harder to fragment = different binding behavior. The Fiedler value measures this directly: the algebraic connectivity of the molecular graph.

THE CHALLENGE

Given a molecule, predict whether it is active against HIV. Standard approach: compute molecular descriptors (weight, polarity, hydrogen bonds), feed them to a classifier. We asked: what happens if you also describe the molecule as a coupled system — measuring its connectivity structure, energy distribution, and internal tension?

THE DATASET

MoleculeNet HIV: 41,127 compounds from real screening data. 1,443 active (3.5%). We subsampled to 4,943 molecules (all 1,443 active + 3,500 inactive) for class balance. AUC-ROC is rank-invariant so relative comparisons hold, but absolute AUC numbers would differ on the full dataset.

THREE MODELS

Model	Features	AUC (mean ± std)
A	Standard only (16)	0.753 ± 0.008
B	K/R/E/T only (29)	0.796 ± 0.013
C	Combined (45)	0.808 ± 0.014

Delta C − A: +0.055
All 5 folds positive: +0.055, +0.061, +0.036, +0.048, +0.074

Statistical test:
  t = 8.60 (dof = 4)
  95% CI: [+0.037, +0.073]
  Lower bound well above zero
  p < 0.001 (paired t-test)

FEATURE IMPORTANCE

K/R/E/T features captured 61.5% of total XGBoost feature importance despite being 64% of feature count (29/45). The top features from Model C:

#	Feature	Importance	Source
1	n_aromatic	0.0482	Standard
2	K_fiedler	0.0476	K/R/E/T
3	frac_s	0.0434	Standard
4	K_wiener_norm	0.0362	K/R/E/T
5	K_max_degree	0.0332	K/R/E/T
6	frac_halogen	0.0267	Standard
7	R_mass_cv	0.0265	K/R/E/T
8	mol_weight	0.0257	Standard
9	tpsa_est	0.0253	Standard
10	hbd	0.0239	Standard

The K (coupling) features dominate the K/R/E/T set. Fiedler value — how easily the molecular graph fragments — is the second most predictive feature overall, behind only aromaticity count.

CONTROLS

Proxy removal:
  2 of 29 K/R/E/T features are proxies (|r| > 0.9) for standard features:
    T_conf_entropy ~ n_rotatable (r=1.000)
    E_bond_energy_total ~ n_heavy (r=0.994)
  After removing both: Model B AUC = 0.797 (unchanged)

Feature count control:
  Standard features padded with 13 noise columns (16 → 29)
  Result: AUC = 0.716 (worse than unpadded)
  More features with no signal hurts. The gap is not explained by count.

Independence:
  24 of 29 K/R/E/T features have |r| < 0.7 with all standard features
  K_fiedler (top K/R/E/T feature) has low correlation with all standard features

K/R/E/T FEATURE DEFINITIONS

K — Coupling (6 features):
  avg_degree — mean atom connectivity
  max_degree — highest-connected atom
  clustering — local triangle density
  density — edge fraction of complete graph
  Fiedler value — algebraic connectivity (λ₂ of graph Laplacian)
  Wiener index (normalized) — mean shortest path length

R — Synchronization (6 features):
  EN variance — electronegativity spread
  EN mean diff — average pairwise EN difference
  EN max diff — most polar bond
  Element entropy — Shannon entropy of element distribution
  C/(N+O) ratio — carbon-to-heteroatom balance
  Mass CV — coefficient of variation of atomic masses

E — Energy (4 features):
  Total bond energy — sum of estimated bond strengths
  Bond energy per atom — normalized energy density
  Bond energy variance — how uniform the bond distribution is
  Ring strain — estimated strain from small rings

T — Tension (3 features):
  Rotatable fraction — conformational flexibility
  Conformational entropy — log of rotamer count estimate
  Degree entropy — Shannon entropy of atom degree distribution

Fiedler Damage (3 features):
  FD_damage_mean — mean fragmentation on single-atom removal
  FD_damage_max — worst-case fragmentation
  FD_damage_std — variability of fragmentation

K-Lag Autocorrelation (7 features):
  mass_autocorr_lag{1-4} — atomic mass correlation at bond distances 1–4
  en_autocorr_lag{1-3} — electronegativity correlation at bond distances 1–3

CROSS-VALIDATION DETAIL

Fold	Model A	Model B	Model C	Delta (C−A)
1	0.750	0.796	0.805	+0.055
2	0.745	0.793	0.806	+0.061
3	0.748	0.776	0.784	+0.036
4	0.768	0.794	0.816	+0.048
5	0.753	0.818	0.827	+0.074
Mean	0.753	0.796	0.808	+0.055

MM12P AUDIT

Five claims. Tried to kill all five. Two survived clean, two weakened, one killed.

CONFIRMED K/R/E/T features add +0.055 AUC

Paired t-test: t = 8.60, dof = 4. 95% CI: [+0.037, +0.073]. The lower bound is nearly 6 standard errors above zero. All 5 folds positive. The smallest fold delta (+0.036) is still meaningful. With a baseline of 0.753, an improvement to 0.808 crosses the 0.80 threshold that separates "okay" from "useful" in screening. The signal is real.

WEAKENED K_fiedler is the breakout coupling feature

K_fiedler (0.0476) is the #1 K/R/E/T feature and #2 overall, nearly tied with n_aromatic (0.0482). It has low correlation with standard features — it is capturing something they miss. But the Fiedler value is from Fiedler (1973). It is a well-known graph metric used in VLSI placement, community detection, and spectral clustering. We did not invent it. We applied it. The claim should be: "a 50-year-old graph metric, when applied to molecular graphs, captures genuine structural information that standard drug descriptors miss." That is less exciting but more honest.

WEAKENED Permutation test confirms signal (p=0.00)

20 permutations is not enough. p=0.00 from 20 trials only means p < 0.05, not p < 0.001. For a proper test, you need 1,000+ permutations. However: the gap between Model C (0.808) and the permutation mean (0.707) is 0.10 AUC — enormous. The signal would almost certainly survive 1,000 permutations. We just cannot claim that precision from the test as run. Corrected claim: "permutation test confirms signal exists (p < 0.05). The 0.10 AUC gap suggests p is much lower, but we ran too few permutations to say how much lower."

KILLED Model B alone beats standard cheminformatics

Model B (0.796) beats Model A (0.753). But Model A uses 16 simplified descriptors with a minimal SMILES parser — no stereochemistry, no charges, no implicit hydrogens. Real cheminformatics baselines use Morgan fingerprints (2,048+ binary features), ECFP, and MACCS keys. Published benchmarks on MoleculeNet HIV using Morgan fingerprints alone achieve 0.80–0.85 AUC. Against a real baseline, Model B likely loses. The claim as stated is dead. What survives: "K/R/E/T features beat simplified descriptors and add signal on top of them." That is weaker but honest.

WEAKENED Same math across domains

The Fiedler value works on molecular graphs, transaction networks, and protein contact maps because it works on any graph. That is Fiedler's result (1973), not ours. "Same math" is true but trivially so — linear algebra works on matrices, and graphs produce matrices. What IS ours: the K/R/E/T interpretation layer, the autocorrelation features (K-lag), and the systematic application across domains. The claim should be: "graph topology features are useful across domains, and the K/R/E/T framework provides a consistent language for them." Not: "we discovered universal math."

WHAT SURVIVES

1. K/R/E/T features add real signal to bioactivity prediction.
  +0.055 AUC, t=8.60, all folds agree, survives proxy removal.

2. Fiedler value captures molecular structure that standard descriptors miss.
  Low correlation with all 16 standard features. #2 overall importance.

3. Graph topology features (K) dominate the K/R/E/T set.
  Top 3 K/R/E/T features are all K: Fiedler, Wiener, max_degree.

4. Combined model crosses the 0.80 AUC threshold.
  From 0.753 (okay) to 0.808 (useful) for drug screening.

WHAT WAS KILLED

1. "K/R/E/T alone beats standard cheminformatics."
  It beats a simplified baseline. Against Morgan fingerprints, it likely loses.
  Needs testing against RDKit + ECFP + MACCS before this claim can live.

2. "p=0.00" overstates the permutation test.
  20 permutations gives p < 0.05, not p < 0.001. Run 1,000+.

3. "Same math" overstates novelty.
  Fiedler (1973). Not ours. Good application, not new discovery.

HONEST LIMITS

Baseline is weak:
  16 simplified descriptors, minimal SMILES parser. No Morgan fingerprints, no ECFP,
  no MACCS keys. Against a full cheminformatics pipeline, the delta would shrink.

One dataset:
  HIV only. Needs BBBP, BACE, Tox21 before "bioactivity" can be claimed broadly.

No external validation:
  All results are within-dataset CV. No held-out temporal split.

Fiedler is expensive:
  O(n³) per molecule. For large molecules, needs approximation.

Subsampling changed class balance:
  Original: 3.5% positive. Ours: 29.2%. AUC is rank-invariant but
  calibration and practical thresholds would differ on the full dataset.

"Same math" does not equal "same mechanism":
  Graph metrics work everywhere because graphs are everywhere.
  Fiedler on a molecule and Fiedler on a bank are the same computation
  applied to different things. The framework provides language, not magic.

REPRODUCIBLE

pip install begump

from gump.bioactivity import run_hiv_benchmark

# Run the full benchmark
results = run_hiv_benchmark(n_folds=5, subsample=True)
print(results['delta_ca']) # +0.055
print(results['model_c']['mean_auc']) # 0.808

Mutation Scanner — Fiedler damage on protein contact graphs (0.82 AUC single feature)

Financial Crime — Fiedler vector on transaction networks (5/5 fraud detected)

Framework — K/R/E/T definitions across 20 domains

K-Lag Spectrum — K as a function of timescale, not a single number

Jim McCandless, beGump LLC. All computation on Mac Mini M4, 16GB, 35W. No cloud. MoleculeNet HIV dataset. XGBoost 300 trees, depth 6, lr 0.1. Custom SMILES parser (no RDKit). Feature computation: ~490 molecules/sec.

GUMP — Research · Support · [email protected]