145 pull requests: shipping with an agent in small, verifiable steps

1 July 2026

The whole game arrived as roughly a hundred and forty-five pull requests — not a few sweeping commits, but a long stream of small, self-contained changes, each a feature or a fix or a polish pass, each reviewed and merged on its own. That cadence wasn't an accident, and it's the last piece of how building with an agent works.

Why small steps suit this collaborator

It comes back to the fact under everything in this series: the agent has no memory across sessions. Small pull requests are the natural cadence for a collaborator like that.

A small PR is a complete unit of work the agent can own without remembering a three-week arc — one goal, one diff, plus the rules and the docs. It's also something the gates can actually vet and something I can actually read; a five-thousand-line change from an agent is unreviewable, which puts you back to trusting its say-so, the one thing this whole approach avoids. And when a small change goes wrong, you revert one small thing rather than untangling a megachange. The throughput an agent gives you is real, but you don't spend it on bigger steps — you spend it on more small ones.

Big features shipped in phases

Even the largest features arrived in pieces. The biggest addition to the game — a whole new progression pillar of persistent equipment — didn't land as one heroic PR. It shipped as a sequence: first the data model and a hand-authored starter set behind the existing systems, then generated drops, then the enemies designed to counter it, then the random-magnitude rolls. Each phase was independently shippable, independently gated, and left the game working.

That phasing is how you get an agent to build something too big for one session without losing the plot. You and it break the ambition into a staircase of individually verifiable steps, and it climbs one per session; the design log holds the staircase so no session forgets which step is next.

The forcing function I didn't plan for

One bit of discipline did quiet, heavy work across all of those steps, and it's the most transferable idea here. The game is deterministic, and a finished run can be replayed to reproduce its exact outcome — which is how the leaderboard verifies scores instead of trusting them. To keep that property, there's a rule: any change to how the game resolves bumps a replay version, and any change that could invalidate old recorded runs regenerates the reference data.

That reads like bookkeeping. In practice it forced the agent to reason, on every change, about a question it would otherwise skate past — does this touch the deterministic core? You can't casually perturb the rules when perturbing them visibly breaks replay and trips the check. It turned "be careful about the core," a vibe an agent can't reliably hold, into a concrete, unavoidable step. The best guardrails aren't warnings; they're mechanisms that make the risky thing hard to do by accident.

The whole thing, in one place

Seven posts, one argument. Building a real project with an AI coding agent worked — not because the model is magic, but because of a small stack of practices that each answer a specific weakness of the collaborator:

It forgets the rules → a constitution it reads every session.
It forgets the state → external memory it reads first.
It's confident when wrong → gates that run the system and decide.
It has no taste and shouldn't set its own limits → a human directing and amending.
It has no long-term view → small, phased, individually verifiable steps.

Under all of it is the line this series keeps returning to: give the agent invariants it can test, not just ones you state. Determinism made the system checkable, the gates made every "it's fine" answerable, and the constitution and the docs kept a hundred and forty-five sessions coherent in a codebase no context window could hold. The agent wrote most of the code; the scaffolding is what made the code add up to a game. If you're about to build something ambitious this way, build the scaffolding first — it's the difference between throughput you can trust and a fast pile of plausible mush.

▶ Play Deep Keep