I built a roguelite mostly with an AI agent — here's what actually made it work

1 July 2026

Deep Keep is a tactical roguelite. It was built almost entirely with an AI coding agent — the agent wrote most of the code, across roughly a hundred and forty-five pull requests, over a few weeks. This series is about that process, not about the game's mechanics: what building something real this way is actually like, and the handful of practices that kept it from falling apart.

Most write-ups of building with AI stop at the demo — the part where a model emits a plausible-looking feature and everyone claps. The problems that decide whether you have a project start later, around day three, when you have twenty features and they've begun to contradict each other.

The problem isn't capability

The thing that mattered most was also the least surprising: an AI agent can write almost any single feature you ask for. Combat resolver, procedural generator, leaderboard, save-sync, a card-battler UI — none of that is where the difficulty lives. For a task that fits in one sitting, capability is basically a solved problem.

What's hard is coherence over time. Feature #100 has to respect the decisions that made feature #3 work, in a codebase far larger than any context window can hold — and the agent writing #100 has no memory of writing #3. Left to itself it will re-solve a settled problem a different way, contradict an invariant it set up last week, or add a shortcut that breaks the architecture. Not out of incompetence: the local task looked like it called for that, and the wider picture wasn't in front of it.

So the real discipline of building this way is keeping the agent coherent across its own forgetfulness. Everything else in this series is a version of that one problem.

How the work divided up

The split settled quickly and held:

I owned intent and judgement — what to build next, what "good" meant, which trade-offs were acceptable, when something was done.
The agent owned the typing and most of the small-scale reasoning. Given a well-scoped task and the project's rules, it wrote the code, the tests, and usually a sharper design than my one-line request implied.
Automated checks owned the verdict on whether it worked — a typecheck, a test suite, and purpose-built balance simulations that pass or fail on their own, independent of anyone's opinion.

That third actor is the one people underrate, and it gets its own post, because it turned out to be the beam the whole thing rests on.

The practices that kept it coherent

Compressed to its essentials, the experience came down to four things, one post each:

A constitution the agent reads every session — a short list of non-negotiable rules at the root of the repo, so every task starts with the lines it must not cross already in view (part 2).
External memory — design and backlog docs that outlast the context reset, so "what are we doing, and what's already done" always has an authoritative answer (part 3).
Verification the agent can't talk its way past — tests and simulations as ship-gates, because a model will vouch for its own work whether or not the work deserves it (part 4).
A tight human loop — directing, reshaping the vague or risky requests, and amending the rules deliberately when the design genuinely outgrew them (part 6).

There's also a post on the times the agent was confidently, expensively wrong (part 5) and exactly what caught each one, and one on why the work came out as a hundred and forty-five small pull requests rather than a few big ones (part 7).

The short version

If you take one idea from the series, take this one, because the rest keeps returning to it: an AI agent drifts when its rules are things you merely state, and it compounds when its rules are things it can test. A written-down principle gets rationalised around the moment a task seems to need it; a principle backed by a check that fails the build does not. Most of the work below is about turning good intentions into that second kind of rule — starting with the cheapest version of it, the file the agent reads before it writes anything at all.

▶ Play Deep Keep