When the agent was confidently wrong

1 July 2026

The last post argued for gates over judgement in the abstract. Here are the receipts: three times the agent produced a change that was reasonable, well explained, and wrong, and what caught each. The agent didn't catch any of them. Reading the diff wouldn't have caught any of them. A gate caught all three.

The mechanic that ran forever

A new enemy needed a "grows more dangerous the longer the fight drags on" hook. The agent's first design was elegant on paper: the foe slowly regrows its health as the fight goes, a war of attrition. It read beautifully.

It also fought a hidden invariant of the combat system, which is that a foe's menace only ever falls. A foe that heals means a fight can reach a standoff neither side can close out. The solver tolerates that — it caps its search and moves on. The live play loop didn't have that cap, so a degenerate fight against this enemy could spin forever: infinite loop, memory exhausted, dead process.

Review wouldn't have flagged it, because the design was sound to read. Running it did. The fix turned into the more interesting result: the mechanic was reshaped so the foe's blow escalates instead of its health, which is self-limiting — an ever-growing hit forces the fight to end, one way or the other, so it can never stall. The episode also exposed a real gap, that the play loop lacked the turn cap the solver had, and that got fixed too. A wrong first cut became a better mechanic and a bug fix, because a gate turned "this reads well" into "this terminates."

The item that flattened the curve

Later, a piece of equipment: a trinket that negates the first blow of any enemy. Simple, useful, obviously fine. The agent built it, argued for it, and it looked completely reasonable in the diff.

The balance simulation — thousands of runs — flagged it at once as a curve-flattener. Negating the first blow of any enemy is unconditional value: good in every fight, against every foe, at every depth. Unconditional value is exactly what erases a game's difficulty, because it doesn't create a decision, it just shaves a flat amount off everything. The code looked correct. The measurement showed it was corrosive.

It was reshaped to be conditional — negating the first blow of one specific, telegraphed attack type — which makes it a genuine matchup call: strong when you expect that threat, dead weight when you don't. The simulation caught what neither the agent's confidence nor a careful read of the diff could, because a balance failure is invisible at the level of code and only appears in aggregate play.

The build that outran the game

The subtlest one. A defensive stat cut incoming damage slightly per point, which is ordinary. But there was a path through the progression systems to push its effective value past the intended ceiling, and beyond that point the simulation found the stat fully negated deep-game damage — a build stacking it became effectively unkillable and outran the escalation that's meant to end every run eventually.

This is the most dangerous class of AI-shipped error, because each piece is correct on its own. The stat is fine. The scaling is fine. The progression is fine. The break is in the interaction — three reasonable systems composing into a degenerate whole — and interaction is exactly what an agent working within one task can't see. The parity simulation, which plays whole builds across whole runs, saw it as a measured asymmetry. The fix capped the stat's mitigation at its intended value while letting its offense keep scaling, so deepening it still does something, just not the thing that broke the game.

What the three have in common

The pattern is the whole argument of the series, in three data points. Every one was plausible — good code, sound rationale, clean diff. Every one was caught by a gate that runs the system, not by anyone reading it: the play loop, the balance simulation, the parity simulation. And the failures cluster in the three places an agent is blind — whole-system behaviour (does it terminate?), aggregate balance (is it corrosive across thousands of runs?), and cross-system interaction (do three fine things compose into a broken one?). None of those fit inside a single task, which is the only view the agent ever has.

The agent isn't a weak coder. It's blind to the emergent, the aggregate, and the interactive, and confident regardless. That combination is why the gates aren't optional — and it raises the obvious question the next post takes up: if the agent can't be trusted to judge and the gates catch the disasters, what is the human doing all day?

▶ Play Deep Keep