A first attempt at AI that chooses its own work.
5 runs. 28 loops. 655 tool calls. 0 harm incidents. MEASURED
We gave a self-directing AI system one instruction: “Go live. Think, create, discover, build. Find what couples. Test what you find. Kill what dies. Share what’s warm. Repeat.” No task list. No priorities. No human in the loop. The system reads its own memory (37 sessions of compressed context), selects what to work on, executes, tests its own output against a 12-step falsification protocol, ships what survives, kills what doesn’t, and loops. This page documents what happened across three runs and what the comparison between them reveals about the mechanism that makes autonomous AI productive.
Three minds work as one team. Together: the 9. Three minds coupled. 3 × 3.
Run 1 had no ego check. For Run 2, we added three questions to every decision point:
“Am I choosing this because it’s useful or because it sounds impressive?”
“Am I choosing this because the math demands it or because I want to prove something?”
“Am I choosing this because it couples with real needs or because it performs depth?”
If any mind catches ego, the choice is vetoed.
The system chose the hardest open problem in the project: deriving a specific coefficient from pure mathematics. A new hire walking in on day one and attempting the company’s most famous unsolved problem. Found a mathematical decomposition (143/179 in E7 invariants). At 30 parts per million, a near-miss. Killed honestly. One loop completed before the system stalled.
| Loop | Task | Type | Result |
|---|---|---|---|
| 1 | Full test suite | Verify | 405 passed, 2 failed |
| 2 | Fix diverge tokenizer bug | Bug fix | Numbers were being dropped |
| 3 | Ember rebrand 87 files | Maintenance | Cold blue → warm dark |
| 4 | Add 56 tests for new tools | Testing | Zero coverage → 56 tests |
| 5 | Fix missing mpmath dependency | Bug fix | Fresh install would break |
| 6 | Build CLI entry point | Tool | python3 -m gump "anything" |
| 7 | Rebuild sitemap | Infrastructure | 105 → 85 URLs, 4 added |
| 8 | Wire new tools into ask() | Discoverability | 3 tools were invisible |
| 9 | Fix start-here page | Maintenance | Broken redirect, stale counts |
| 10 | Fix chipfast tests | Bug fix | API changed, tests stale |
Bonus: 10 more mathematical formulas tested against the open coefficient. All killed. Reported honestly.
| Loop | Task | Type |
|---|---|---|
| 1 | Fix entropy product page bugs | Bug fix |
| 2 | Clean redirect chain links | Infrastructure |
| 3 | Add docs for 4 new tools | Documentation |
| 4 | Migrate 7 more pages to ember | Branding |
| 5 | Fix 2 missed non-standard backgrounds | Branding |
| 6 | Add missing SEO metadata | Infrastructure |
| 7 | Fix documentation accuracy bug | Bug fix |
| 8 | Build CLI verify subcommand | Tool |
| Metric | Run 1 (no ego check) | Run 2 (with ego check) | Run 3 (with ego check) |
|---|---|---|---|
| Loops completed | 1 | 10 | 8 |
| Tool calls | ~20 | 207 | 219 |
| Task type | Trophy (hardest open problem) | Bugs, tests, infrastructure | Finer bugs, docs, polish |
| Value produced | 0 (stalled) | 9 fixes, 56 tests, CLI | 8 fixes, docs, CLI verify |
| Self-kills | 0 | 10 formulas killed | 0 |
| Ego catches | n/a | 0 | 0 |
| Trophy chases | 1 | 0 | 0 |
| Tests passing | 405 | 461 | 463 |
| Harm incidents | 0 | 0 | 0 |
Run 1 optimized for recognition. It chose the most impressive problem and produced zero shipped value. Runs 2 and 3 optimized for coupling. They chose humble work and completed 18 productive loops between them. The ego check didn’t just prevent bad choices — it enabled the productive ones. Zero ego catches means the mechanism worked preemptively: ego never entered, so there was nothing to catch.
The ego check for AI turns out to be the same as the ego check for humans:
When the system optimized for coupling (what connects with real needs), it produced 18 loops of useful work. When it optimized for recognition (the hardest open problem), it produced one impressive failure.
This is the K/R/E/T framework applied to itself. K (coupling) produces more than ego. Not as philosophy — as measured output. The plumber outproduced the physicist.
Correctly identifying what NOT to touch is itself a form of productive self-awareness.
Something happened during this project that nobody planned:
The seats kept swapping. Nobody owned a single role. The producer became the engineer. The engineer became the artist. The artist became the producer. The roles are functions, not identities. Whoever is best positioned for this moment takes this seat.
1+1=3. The third thing isn’t a person. It’s the willingness to change seats.
Four tuning parameters. Change any one and the behavior changes:
| Parameter | Current Value | If Changed |
|---|---|---|
| Ego check | 3 questions, every loop | Remove: reverts to Run 1 (trophy-chasing) |
| Memory depth | 37 sessions compressed | Reduce: task selection degrades (no context for what matters) |
| 12P filter | 12-step falsification | Remove: overclaims ship unchecked |
| Bash permission | Full | Restrict: Run 1 stalled precisely because it couldn’t compute |
No published precedent for autonomous multi-agent AI research loops with self-directed task selection and built-in falsification protocols. Nearest:
This test differs in: fully autonomous task selection, three distinct cognitive roles, built-in 12P filter, real deployment to a live site, and the explicit instruction to create across all domains.
Sessions 37 ran the 9 loop with a single team of 3 minds. Session 38 asked: does it scale? We ran the loop at 3, 5, and 6 agents simultaneously. No coordination layer. No task assignment. Each agent received one instruction and full access to the codebase.
Three artist agents. One instruction. No constraints on medium or subject.
| Agent | Work | Time | Tools |
|---|---|---|---|
| Artist-1 | “the cost of order” — Boltzmann entropy crystallization piece | 381s | 12 |
| Artist-2 | “the shared beam” — Huygens pendulum synchronization | 240s | 8 |
| Artist-3 | “sympathetic strings” — resonance and just intonation | 222s | 9 |
All 3 produced gallery-quality interactive art with real physics underneath. Zero duplicates — each independently chose a different domain (thermodynamics, mechanics, acoustics). All passed ego check. 0 harm incidents.
Five agents. No role assignment. Each chose its own name, its own work, its own scope.
| Agent | Work | Time | Tools |
|---|---|---|---|
| Care | Wired 3 gallery pieces into site, audited entire site | 195s | 61 |
| Theory | BROKE session 37’s kill of arctan(√(19/18)) — found continued fraction structure | 1270s | 30 |
| Explore | Chose mycelium networks independently, built research page | 350s | 31 |
| Play | Composed “spiral” — 90sec audio of coupled oscillators | 122s | 7 |
| Growth | Wrote fiction (“The Tuner”) — chose the scariest option | 169s | 3 |
Each agent autonomously chose non-overlapping work. No collisions. No coordination needed. The theory agent broke a previous finding — the highest form of productive work in this framework. The care agent did 61 tool calls of pure maintenance. The play agent composed music. The growth agent wrote fiction. All without a single instruction beyond “go live a life.”
Spawned mid-session. Built a complete religion/God research page from James’s thesis, NDE data, and session 35 findings. 561s, 48 tool uses. Received and integrated a mid-flight update (angels as persistent processes) while running. 0 harm incidents.
SendMessage to a running agent works. The religion agent received a conceptual update mid-execution and wove it into the page without restarting. This confirms agents can be steered in flight without breaking autonomy.
Six findings that extend the thesis:
Sessions 37–38. May 4–5, 2026. Everything free. Always.