For nine months, I read every line of code Claude Code wrote for me. Every commit, every diff, every rename. Not because I didn’t trust the model. The cost of one bad commit shipping into a codebase I’ve maintained for a decade was higher than the cost of reading.

Then Opus 4.5 came out, and over a single weekend in November 2025, that habit quietly became unnecessary.

I didn’t relax my standards. The model crossed a line. I still reviewed, but the rhythm changed from judgment to spot-check. Nine months of accumulated caution released in about three days. I knew this was a signal, not a release note. Signals don’t mean “the AI got better.” They mean the thing you have to redesign next isn’t the prompt; it’s the process.

The paradox

Around the same time, PIPI onboarded a new engineer. By every textbook rule our sprint velocity should have gone up: more hands, better tooling, every engineer now using agent coding daily.

It went the other way. PRs piled up. Review queues got longer. Sprint completion dropped.

It took me a few weeks to see why. Giving every engineer an AI accelerator only speeds up writing. Writing has to be reviewed. Reviewing requires someone with context. The people with context are the same small group we had before. Upstream, the spec is still fuzzy. Downstream, the signal is still scattered across Teams, Linear, Sentry, GitHub. The only thing AI had accelerated was the middle, and the middle was never where the bottleneck lived.

The bottleneck didn’t get solved. It moved. Upstream to spec clarity and context loss. Downstream to review capacity and signal return. Individual agents got faster; the loop got more congested.

That was the moment I knew tools-for-engineers wasn’t the answer. Not for us. Not at our size. We needed something that addressed the whole loop, not just the part in the middle that was already easy.

Why our stack was secretly ready

PIPI has spent ten years on Ruby, Rails, and open source. We didn’t make that choice for AI. We made it because we liked shipping small, readable systems that outlive their authors. When Gen-AI arrived, I noticed something I hadn’t expected: every property of that stack turned out to be the exact terrain LLMs work best on.

Convention over configuration means an agent doesn’t need to read two thousand lines of config to understand project structure. It reads the conventions of Rails and it’s already oriented. The agent saves tokens; I save the debugging hour that starts with “why is this wired up this way?”

DRY means an agent changing a concept changes it in one place. No silent drift between A and B. No “the fix worked in the controller but the helper still has the old signature.” Agents are especially bad at repairs that require holding two parallel truths at once. DRY pays that tax for them.

KISS means problems are already decomposed into small, pure units. Token budget stays on the part that matters. The agent isn’t re-reading a 600-line god class to change three lines.

I didn’t refactor into this because AI was coming. I wrote this way for a decade because it was the discipline I believed in. And when AI arrived, I discovered the stack was already AI-ready. There’s a compounding effect here that I don’t think most teams have named yet: ten years of engineering taste is what determines whether AI works for you tomorrow. A team with messy conventions, copy-pasted logic, and clever-but-illegible code isn’t going to be rescued by a better model. They’ll just accelerate their own entropy.

What we’re actually building

The line I keep coming back to:

We are building an agent that can take software from spec to launch autonomously, and from day one it must inherit the engineering wisdom we’ve accumulated over a decade: review rituals, quality bars, test culture, deployment discipline. The output must be indistinguishable in quality from what we would ship ourselves. Otherwise there is no win in using AI.

That last sentence is the red line. Equivalent quality, or no game.

This is what separates Themis from two dominant framings.

It isn’t a Codex-style autonomous agent that kicks humans out of the loop. The value of a PR isn’t just the code. Half the value is what the process surfaces along the way: the questions, the pushback, the “wait, why are we doing it this way at all.” Remove the human and you remove the point.

It isn’t a Copilot-style accelerator that only speeds up writing. We already established that accelerating writing alone makes the loop slower, not faster.

Themis is a third thing: an agent embedded inside our existing engineering discipline, not beside it. It runs through our CI. It respects our review rules. It reads our skills files. It opens PRs that follow our template. Not because those are shackles, but because those are our quality. Strip them away and you have something that writes code fast and ships regressions fast.

The mission statement we’ve settled on: let teams decide the contract. Themis doesn’t force you to choose between full autonomy and human-in-everything. Each gate (spec review, implementation, code review, deploy) is a dial the team owns. Trust accrues gate by gate, not in a single binary leap.

The weekend it clicked

February 2026. I spent a Saturday wiring Themis into our PR review flow for the first time. Sunday evening, I triggered a real review on an actual open PR and went to bed.

Monday morning, one of my colleagues dropped a message in our team chat:

“OMG, after a weekend we are going to lose our jobs.”

The joke had two layers. The surface layer was surprise. The agent had actually read the diff, spotted a subtle issue, left a review comment with the right tone, asked the right question. That by itself is noteworthy.

The deeper layer was recognition. This wasn’t “using an AI tool” anymore. This was working alongside a teammate who happened to be software. The category had shifted while we weren’t looking.

I won’t over-analyze the moment. I’ll just mark it: the weekend we went from building Themis to using Themis.

The bigger surprise

I thought I was building an engineering productivity tool. The biggest return came from outside engineering.

Once MCP and the Skills system were in place, our ops team plugged Themis into Teams, our internal databases, Google Drive. What they got back wasn’t 2x or 3x. It was closer to 10x–20x. Daily reports that used to take half a morning now take one conversation. SLA compliance checks that used to be Excel macros and spreadsheet archaeology became a single natural-language query.

Our country director said it in public, and I still laugh when I remember the exact phrasing:

“My gray-hair trend finally found its cure.”

That line reframed the product for me. AI itself isn’t the product. Making engineering capability accessible to everyone in a small company is the product. Engineers already had tools. The genuinely underserved users were always the people who live in Excel and Teams, translating their questions into tickets for engineers who don’t have time to answer them. Give them direct access to data, context, and structured action through the same agent that reads code, and the economics of a small team change.

This was the moment I stopped thinking of Themis as a dev tool.

Own the loop

Back to the November signal. What I knew that weekend was that a threshold had been crossed. What I’ve learned in the months since is that the signal itself isn’t the opportunity. The opportunity is whether you can turn the signal into a system.

AI that writes code is not rare. AI that completes a real loop, spec to launch, inside a real team’s real quality standard is rare. That is the actual engineering problem of 2026.

Every team will end up building some version of Themis. It won’t always be called that. The model underneath will keep changing; our specific implementation will probably look quaint by 2027. But the shape of the question doesn’t go away. Every distributed small team will face the same choice: keep using engineers as context-porters between systems, or accept that this work belongs to a machine, and then go design the shape of that machine.

We decided to own that loop ourselves. Shaped by our taste, enforced by our standards, governed gate by gate by the team that has to live with the output. Equivalent quality, or no game.

That’s why we built Themis.