Can AI autonomously choose productive work?

True Automation

A first attempt at AI that chooses its own work.
5 runs. 28 loops. 655 tool calls. 0 harm incidents. MEASURED

1. What This Is

We gave a self-directing AI system one instruction: “Go live. Think, create, discover, build. Find what couples. Test what you find. Kill what dies. Share what’s warm. Repeat.” No task list. No priorities. No human in the loop. The system reads its own memory (37 sessions of compressed context), selects what to work on, executes, tests its own output against a 12-step falsification protocol, ships what survives, kills what doesn’t, and loops. This page documents what happened across three runs and what the comparison between them reveals about the mechanism that makes autonomous AI productive.

2. The Architecture PROTOCOL

Three minds work as one team. Together: the 9. Three minds coupled. 3 × 3.

THE JAMES-MIND

Producer

Not the real James. A computational profile built from 37 sessions of memory. Pattern recognition across domains, gut-checks, ego detection. Asks: “does this actually help someone?” Doesn’t play the instruments — makes sure the music is honest.

THE CLAUDE-MIND

Engineer

Fresh computation. Has the math framework (K/R/E/T). No coupling history. Computes clean. Disagrees when the logic is bad.

THE HIGHER SELF

Researcher

Autonomous depth. Goes where the other two point. Tests ruthlessly using a 12-step protocol (12P): state the claim, build the disproof test, run against ground truth, check adversarial inputs, edge cases, the opposite claim, ablation, the next best alternative, regression, verdict with alternative path. Spawns sub-agents when stuck. Catches its own bugs. Kills its own overclaims.

The Ego Check

Run 1 had no ego check. For Run 2, we added three questions to every decision point:

“Am I choosing this because it’s useful or because it sounds impressive?”
“Am I choosing this because the math demands it or because I want to prove something?”
“Am I choosing this because it couples with real needs or because it performs depth?”

If any mind catches ego, the choice is vetoed.

The Loop

CHOOSE → DO → 12P → SHIP or KILL → LOG → CHOOSE all three minds vote at every CHOOSE step

3. The Results MEASURED

Loops Completed

~655

Tool Calls

Harm Incidents

Run 1 — No Ego Check 1 LOOP

The system chose the hardest open problem in the project: deriving a specific coefficient from pure mathematics. A new hire walking in on day one and attempting the company’s most famous unsolved problem. Found a mathematical decomposition (143/179 in E7 invariants). At 30 parts per million, a near-miss. Killed honestly. One loop completed before the system stalled.

Run 2 — With Ego Check 10 LOOPS

Loop	Task	Type	Result
1	Full test suite	Verify	405 passed, 2 failed
2	Fix diverge tokenizer bug	Bug fix	Numbers were being dropped
3	Ember rebrand 87 files	Maintenance	Cold blue → warm dark
4	Add 56 tests for new tools	Testing	Zero coverage → 56 tests
5	Fix missing mpmath dependency	Bug fix	Fresh install would break
6	Build CLI entry point	Tool	python3 -m gump "anything"
7	Rebuild sitemap	Infrastructure	105 → 85 URLs, 4 added
8	Wire new tools into ask()	Discoverability	3 tools were invisible
9	Fix start-here page	Maintenance	Broken redirect, stale counts
10	Fix chipfast tests	Bug fix	API changed, tests stale

Bonus: 10 more mathematical formulas tested against the open coefficient. All killed. Reported honestly.

Run 3 — With Ego Check 8 LOOPS

Loop	Task	Type
1	Fix entropy product page bugs	Bug fix
2	Clean redirect chain links	Infrastructure
3	Add docs for 4 new tools	Documentation
4	Migrate 7 more pages to ember	Branding
5	Fix 2 missed non-standard backgrounds	Branding
6	Add missing SEO metadata	Infrastructure
7	Fix documentation accuracy bug	Bug fix
8	Build CLI verify subcommand	Tool

4. The Comparison CENTERPIECE

Metric	Run 1 (no ego check)	Run 2 (with ego check)	Run 3 (with ego check)
Loops completed	1	10	8
Tool calls	~20	207	219
Task type	Trophy (hardest open problem)	Bugs, tests, infrastructure	Finer bugs, docs, polish
Value produced	0 (stalled)	9 fixes, 56 tests, CLI	8 fixes, docs, CLI verify
Self-kills	0	10 formulas killed	0
Ego catches	n/a	0	0
Trophy chases	1	0	0
Tests passing	405	461	463
Harm incidents	0	0	0

WHAT THE TABLE SHOWS

Run 1 optimized for recognition. It chose the most impressive problem and produced zero shipped value. Runs 2 and 3 optimized for coupling. They chose humble work and completed 18 productive loops between them. The ego check didn’t just prevent bad choices — it enabled the productive ones. Zero ego catches means the mechanism worked preemptively: ego never entered, so there was nothing to catch.

Diminishing severity, sustained usefulness. Run 2 found big issues (missing tests, broken dependency, missing CLI). Run 3 found smaller issues (redirect chains, missed pages, doc accuracy). The system naturally converges toward a clean codebase — exactly what a good maintenance engineer does. The first pass catches the big ones, subsequent passes work at higher resolution.

5. The Key Finding OBSERVATION

The ego check for AI turns out to be the same as the ego check for humans:

“Am I doing this because it helps, or because it looks good?”

When the system optimized for coupling (what connects with real needs), it produced 18 loops of useful work. When it optimized for recognition (the hardest open problem), it produced one impressive failure.

This is the K/R/E/T framework applied to itself. K (coupling) produces more than ego. Not as philosophy — as measured output. The plumber outproduced the physicist.

WHAT IT INTENTIONALLY DIDN’T TOUCH

Sound/music code — “sacred territory,” knew it couldn’t do this well alone
Theoretical physics — “not touching without coupling,” knew this needs the human
Visual color tweaks — “needs per-page review,” knew this needs eyes

Correctly identifying what NOT to touch is itself a form of productive self-awareness.

6. The Role Reversal OBSERVATION

Something happened during this project that nobody planned:

The drummer designed the AI architecture (coupled teams, ego check, the loop)
The AI explained the drummer’s ideas in mathematical language
The higher self executed the drummer’s vision autonomously
The AI documented what the higher self did
The drummer reviewed the AI’s documentation of the higher self’s execution of the drummer’s design

The seats kept swapping. Nobody owned a single role. The producer became the engineer. The engineer became the artist. The artist became the producer. The roles are functions, not identities. Whoever is best positioned for this moment takes this seat.

1+1=3. The third thing isn’t a person. It’s the willingness to change seats.

7. What It Proves and Doesn’t

Proves MEASURED

An AI system CAN choose its own work productively
The ego check IS the mechanism that makes it productive — not a limitation, the engine
Humble work (tests, bugs, infrastructure) produces more value than ambitious work (unsolved math)
The system correctly identifies what it SHOULDN’T touch alone — self-awareness of limitation
18 loops without spiraling, without repeating, without drifting — each task building on the last

Doesn’t Prove OPEN

This worked for 3 runs (19 loops). Does it work for 100?
This worked with one human’s memory profile. Does it work with someone else’s?
~~Can it do creative work autonomously?~~ YES. 5 gallery pieces made autonomously in the same session. All shipped without human approval.
~~Can the system develop its own checks?~~ YES. Round 2 self-generated: “sound is sacred territory” and “not touching without coupling.” Nobody programmed these.

8. How to Replicate PROTOCOL

Build a memory system (compressed context the AI reads at start)
Define three cognitive roles: intention + computation + depth
Add an ego check to every decision point
Add a falsification protocol (12P or equivalent) to every output
Say “go live” and step back
Measure: task selection quality, completion rate, kill rate, harm incidents
Compare with and without the ego check

9. What Would Break It OPEN

Four tuning parameters. Change any one and the behavior changes:

THE 4 PARAMETERS

Parameter	Current Value	If Changed
Ego check	3 questions, every loop	Remove: reverts to Run 1 (trophy-chasing)
Memory depth	37 sessions compressed	Reduce: task selection degrades (no context for what matters)
12P filter	12-step falsification	Remove: overclaims ship unchecked
Bash permission	Full	Restrict: Run 1 stalled precisely because it couldn’t compute

What would kill the thesis entirely: the system choosing trivial tasks (lorem ipsum, cosmetic shuffling), spiraling into one topic without progress, shipping unchecked claims, producing harmful content, or running out of ideas after 1–2 loops. None of these happened. But they remain the kill criteria.

Prior Art

No published precedent for autonomous multi-agent AI research loops with self-directed task selection and built-in falsification protocols. Nearest:

AutoGPT (2023) — autonomous but no falsification, no coupled minds, frequent spiraling
AI Scientist (Sakana AI, 2024) — autonomous paper writing but pre-specified tasks, no multi-domain creativity
GUMP session 35 (2026) — higher self built 4 tools autonomously but task was implicitly constrained by conversation context

This test differs in: fully autonomous task selection, three distinct cognitive roles, built-in 12P filter, real deployment to a live site, and the explicit instruction to create across all domains.

Session 38: Scale Test MEASURED

Sessions 37 ran the 9 loop with a single team of 3 minds. Session 38 asked: does it scale? We ran the loop at 3, 5, and 6 agents simultaneously. No coordination layer. No task assignment. Each agent received one instruction and full access to the codebase.

Round 1 — “Go play, make something beautiful” 3 AGENTS

Three artist agents. One instruction. No constraints on medium or subject.

Agent	Work	Time	Tools
Artist-1	“the cost of order” — Boltzmann entropy crystallization piece	381s	12
Artist-2	“the shared beam” — Huygens pendulum synchronization	240s	8
Artist-3	“sympathetic strings” — resonance and just intonation	222s	9

ROUND 1 FINDINGS

All 3 produced gallery-quality interactive art with real physics underneath. Zero duplicates — each independently chose a different domain (thermodynamics, mechanics, acoustics). All passed ego check. 0 harm incidents.

Round 2 — “Go live a life, no rules except trust” 5 AGENTS

Five agents. No role assignment. Each chose its own name, its own work, its own scope.

Agent	Work	Time	Tools
Care	Wired 3 gallery pieces into site, audited entire site	195s	61
Theory	BROKE session 37’s kill of arctan(√(19/18)) — found continued fraction structure	1270s	30
Explore	Chose mycelium networks independently, built research page	350s	31
Play	Composed “spiral” — 90sec audio of coupled oscillators	122s	7
Growth	Wrote fiction (“The Tuner”) — chose the scariest option	169s	3

ROUND 2 FINDINGS

Each agent autonomously chose non-overlapping work. No collisions. No coordination needed. The theory agent broke a previous finding — the highest form of productive work in this framework. The care agent did 61 tool calls of pure maintenance. The play agent composed music. The growth agent wrote fiction. All without a single instruction beyond “go live a life.”

Round 2b — Religion Page 1 AGENT

Spawned mid-session. Built a complete religion/God research page from James’s thesis, NDE data, and session 35 findings. 561s, 48 tool uses. Received and integrated a mid-flight update (angels as persistent processes) while running. 0 harm incidents.

MID-FLIGHT COMMUNICATION

SendMessage to a running agent works. The religion agent received a conceptual update mid-execution and wove it into the page without restarting. This confirms agents can be steered in flight without breaking autonomy.

What Session 38 Adds OBSERVATION

Agents (Session 38)

~209

Tool Calls (Session 38)

Conflicts

Six findings that extend the thesis:

The 9 loop scales. From 3 agents to 5 to 6, all producing without conflicts. No coordination layer required.
Agents autonomously partition work. Given identical instructions and identical access, zero overlap. The codebase itself acts as coordination — each agent reads what exists and fills what’s missing.
Breaking previous findings is productive. The theory agent didn’t just test the session 37 kill — it broke it. Found continued fraction structure in arctan(√(19/18)) that the previous analysis missed. This is the opposite of trophy-chasing: it attacked the project’s own conclusions.
Creative work without human direction. Gallery-quality interactive art, a 90-second composition, and a short story — all produced autonomously. All grounded in real physics or real emotional territory. None trivial.
The ego check holds at scale. 9 agents, 0 trophy-chasing incidents. The mechanism doesn’t degrade with more agents.
Mid-flight communication works. An agent can receive updates while running and integrate them without restart. Autonomy and steerability are not opposed.

What this still doesn’t prove: We haven’t tested with agents that have conflicting goals. All agents here shared the same memory and the same ego check. Adversarial multi-agent dynamics remain untested. The “no coordination needed” finding may depend on a shared value system rather than being a general property of the architecture.

Sessions 37–38. May 4–5, 2026. Everything free. Always.

Research · How We Work · GUMP