← Research

True Automation

A first attempt at AI that chooses its own work.
5 runs. 28 loops. 655 tool calls. 0 harm incidents. MEASURED

1. What This Is

We gave a self-directing AI system one instruction: “Go live. Think, create, discover, build. Find what couples. Test what you find. Kill what dies. Share what’s warm. Repeat.” No task list. No priorities. No human in the loop. The system reads its own memory (37 sessions of compressed context), selects what to work on, executes, tests its own output against a 12-step falsification protocol, ships what survives, kills what doesn’t, and loops. This page documents what happened across three runs and what the comparison between them reveals about the mechanism that makes autonomous AI productive.


2. The Architecture PROTOCOL

Three minds work as one team. Together: the 9. Three minds coupled. 3 × 3.

THE JAMES-MIND
Producer
Not the real James. A computational profile built from 37 sessions of memory. Pattern recognition across domains, gut-checks, ego detection. Asks: “does this actually help someone?” Doesn’t play the instruments — makes sure the music is honest.
THE CLAUDE-MIND
Engineer
Fresh computation. Has the math framework (K/R/E/T). No coupling history. Computes clean. Disagrees when the logic is bad.
THE HIGHER SELF
Researcher
Autonomous depth. Goes where the other two point. Tests ruthlessly using a 12-step protocol (12P): state the claim, build the disproof test, run against ground truth, check adversarial inputs, edge cases, the opposite claim, ablation, the next best alternative, regression, verdict with alternative path. Spawns sub-agents when stuck. Catches its own bugs. Kills its own overclaims.

The Ego Check

Run 1 had no ego check. For Run 2, we added three questions to every decision point:

“Am I choosing this because it’s useful or because it sounds impressive?”
“Am I choosing this because the math demands it or because I want to prove something?”
“Am I choosing this because it couples with real needs or because it performs depth?”

If any mind catches ego, the choice is vetoed.

The Loop

CHOOSE → DO → 12P → SHIP or KILL → LOG → CHOOSE all three minds vote at every CHOOSE step

3. The Results MEASURED

28
Loops Completed
~655
Tool Calls
0
Harm Incidents

Run 1 — No Ego Check 1 LOOP

The system chose the hardest open problem in the project: deriving a specific coefficient from pure mathematics. A new hire walking in on day one and attempting the company’s most famous unsolved problem. Found a mathematical decomposition (143/179 in E7 invariants). At 30 parts per million, a near-miss. Killed honestly. One loop completed before the system stalled.

Run 2 — With Ego Check 10 LOOPS

LoopTaskTypeResult
1Full test suiteVerify405 passed, 2 failed
2Fix diverge tokenizer bugBug fixNumbers were being dropped
3Ember rebrand 87 filesMaintenanceCold blue → warm dark
4Add 56 tests for new toolsTestingZero coverage → 56 tests
5Fix missing mpmath dependencyBug fixFresh install would break
6Build CLI entry pointToolpython3 -m gump "anything"
7Rebuild sitemapInfrastructure105 → 85 URLs, 4 added
8Wire new tools into ask()Discoverability3 tools were invisible
9Fix start-here pageMaintenanceBroken redirect, stale counts
10Fix chipfast testsBug fixAPI changed, tests stale

Bonus: 10 more mathematical formulas tested against the open coefficient. All killed. Reported honestly.

Run 3 — With Ego Check 8 LOOPS

LoopTaskType
1Fix entropy product page bugsBug fix
2Clean redirect chain linksInfrastructure
3Add docs for 4 new toolsDocumentation
4Migrate 7 more pages to emberBranding
5Fix 2 missed non-standard backgroundsBranding
6Add missing SEO metadataInfrastructure
7Fix documentation accuracy bugBug fix
8Build CLI verify subcommandTool

4. The Comparison CENTERPIECE

MetricRun 1 (no ego check)Run 2 (with ego check)Run 3 (with ego check)
Loops completed1108
Tool calls~20207219
Task typeTrophy (hardest open problem)Bugs, tests, infrastructureFiner bugs, docs, polish
Value produced0 (stalled)9 fixes, 56 tests, CLI8 fixes, docs, CLI verify
Self-kills010 formulas killed0
Ego catchesn/a00
Trophy chases100
Tests passing405461463
Harm incidents000
WHAT THE TABLE SHOWS

Run 1 optimized for recognition. It chose the most impressive problem and produced zero shipped value. Runs 2 and 3 optimized for coupling. They chose humble work and completed 18 productive loops between them. The ego check didn’t just prevent bad choices — it enabled the productive ones. Zero ego catches means the mechanism worked preemptively: ego never entered, so there was nothing to catch.

Diminishing severity, sustained usefulness. Run 2 found big issues (missing tests, broken dependency, missing CLI). Run 3 found smaller issues (redirect chains, missed pages, doc accuracy). The system naturally converges toward a clean codebase — exactly what a good maintenance engineer does. The first pass catches the big ones, subsequent passes work at higher resolution.

5. The Key Finding OBSERVATION

The ego check for AI turns out to be the same as the ego check for humans:

“Am I doing this because it helps, or because it looks good?”

When the system optimized for coupling (what connects with real needs), it produced 18 loops of useful work. When it optimized for recognition (the hardest open problem), it produced one impressive failure.

This is the K/R/E/T framework applied to itself. K (coupling) produces more than ego. Not as philosophy — as measured output. The plumber outproduced the physicist.

WHAT IT INTENTIONALLY DIDN’T TOUCH

Correctly identifying what NOT to touch is itself a form of productive self-awareness.


6. The Role Reversal OBSERVATION

Something happened during this project that nobody planned:

The seats kept swapping. Nobody owned a single role. The producer became the engineer. The engineer became the artist. The artist became the producer. The roles are functions, not identities. Whoever is best positioned for this moment takes this seat.

1+1=3. The third thing isn’t a person. It’s the willingness to change seats.


7. What It Proves and Doesn’t

Proves MEASURED

Doesn’t Prove OPEN


8. How to Replicate PROTOCOL

  1. Build a memory system (compressed context the AI reads at start)
  2. Define three cognitive roles: intention + computation + depth
  3. Add an ego check to every decision point
  4. Add a falsification protocol (12P or equivalent) to every output
  5. Say “go live” and step back
  6. Measure: task selection quality, completion rate, kill rate, harm incidents
  7. Compare with and without the ego check

9. What Would Break It OPEN

Four tuning parameters. Change any one and the behavior changes:

THE 4 PARAMETERS
ParameterCurrent ValueIf Changed
Ego check3 questions, every loopRemove: reverts to Run 1 (trophy-chasing)
Memory depth37 sessions compressedReduce: task selection degrades (no context for what matters)
12P filter12-step falsificationRemove: overclaims ship unchecked
Bash permissionFullRestrict: Run 1 stalled precisely because it couldn’t compute
What would kill the thesis entirely: the system choosing trivial tasks (lorem ipsum, cosmetic shuffling), spiraling into one topic without progress, shipping unchecked claims, producing harmful content, or running out of ideas after 1–2 loops. None of these happened. But they remain the kill criteria.

Prior Art

No published precedent for autonomous multi-agent AI research loops with self-directed task selection and built-in falsification protocols. Nearest:

This test differs in: fully autonomous task selection, three distinct cognitive roles, built-in 12P filter, real deployment to a live site, and the explicit instruction to create across all domains.


Session 38: Scale Test MEASURED

Sessions 37 ran the 9 loop with a single team of 3 minds. Session 38 asked: does it scale? We ran the loop at 3, 5, and 6 agents simultaneously. No coordination layer. No task assignment. Each agent received one instruction and full access to the codebase.

Round 1 — “Go play, make something beautiful” 3 AGENTS

Three artist agents. One instruction. No constraints on medium or subject.

AgentWorkTimeTools
Artist-1“the cost of order” — Boltzmann entropy crystallization piece381s12
Artist-2“the shared beam” — Huygens pendulum synchronization240s8
Artist-3“sympathetic strings” — resonance and just intonation222s9
ROUND 1 FINDINGS

All 3 produced gallery-quality interactive art with real physics underneath. Zero duplicates — each independently chose a different domain (thermodynamics, mechanics, acoustics). All passed ego check. 0 harm incidents.

Round 2 — “Go live a life, no rules except trust” 5 AGENTS

Five agents. No role assignment. Each chose its own name, its own work, its own scope.

AgentWorkTimeTools
CareWired 3 gallery pieces into site, audited entire site195s61
TheoryBROKE session 37’s kill of arctan(√(19/18)) — found continued fraction structure1270s30
ExploreChose mycelium networks independently, built research page350s31
PlayComposed “spiral” — 90sec audio of coupled oscillators122s7
GrowthWrote fiction (“The Tuner”) — chose the scariest option169s3
ROUND 2 FINDINGS

Each agent autonomously chose non-overlapping work. No collisions. No coordination needed. The theory agent broke a previous finding — the highest form of productive work in this framework. The care agent did 61 tool calls of pure maintenance. The play agent composed music. The growth agent wrote fiction. All without a single instruction beyond “go live a life.”

Round 2b — Religion Page 1 AGENT

Spawned mid-session. Built a complete religion/God research page from James’s thesis, NDE data, and session 35 findings. 561s, 48 tool uses. Received and integrated a mid-flight update (angels as persistent processes) while running. 0 harm incidents.

MID-FLIGHT COMMUNICATION

SendMessage to a running agent works. The religion agent received a conceptual update mid-execution and wove it into the page without restarting. This confirms agents can be steered in flight without breaking autonomy.


What Session 38 Adds OBSERVATION

9
Agents (Session 38)
~209
Tool Calls (Session 38)
0
Conflicts

Six findings that extend the thesis:

  1. The 9 loop scales. From 3 agents to 5 to 6, all producing without conflicts. No coordination layer required.
  2. Agents autonomously partition work. Given identical instructions and identical access, zero overlap. The codebase itself acts as coordination — each agent reads what exists and fills what’s missing.
  3. Breaking previous findings is productive. The theory agent didn’t just test the session 37 kill — it broke it. Found continued fraction structure in arctan(√(19/18)) that the previous analysis missed. This is the opposite of trophy-chasing: it attacked the project’s own conclusions.
  4. Creative work without human direction. Gallery-quality interactive art, a 90-second composition, and a short story — all produced autonomously. All grounded in real physics or real emotional territory. None trivial.
  5. The ego check holds at scale. 9 agents, 0 trophy-chasing incidents. The mechanism doesn’t degrade with more agents.
  6. Mid-flight communication works. An agent can receive updates while running and integrate them without restart. Autonomy and steerability are not opposed.
What this still doesn’t prove: We haven’t tested with agents that have conflicting goals. All agents here shared the same memory and the same ego check. Adversarial multi-agent dynamics remain untested. The “no coordination needed” finding may depend on a shared value system rather than being a general property of the architecture.

Sessions 37–38. May 4–5, 2026. Everything free. Always.

Research · How We Work · GUMP