ESSAY · 5 min · 2026.06.03

Two years ago, I'd have filed a ticket

A customer messaged me with a question I couldn’t answer with the product as it shipped:

In addition to running your reviewer on our current PRs, we wanted to backtest it — replay an old PR at the commit it was opened at, before the human comments and the fix-up commits, and see what your reviewer would have caught. Is there a way to run it against a list of past commits?

It’s a good question. They already knew how those PRs turned out. They wanted to point a new tool at history they understood and compare it against what their current setup caught. That’s a backtest, and it’s the right instinct — you don’t evaluate something new on cases where you don’t know the answer.

Two years ago my honest reply would have been: great idea, let me file a feature request. And then it would have sat in a queue behind everything else, and the customer would have moved on.

Instead I built it that afternoon. It’s a small CLI called pr-backtest, and the customer is running it themselves.

What it does

Point it at a pull request URL. It recreates that PR — all of its commits, from the PR’s merge-base up to its head — as a brand-new PR in a repo you control. A review bot then reviews the recreated PR as if the code were written this morning. Your real main is never touched.

The default is the interesting part: it recreates the PR as it was when it was opened — only the commits that existed before the review comments drove a round of fixes. That’s the high-value case. You’re not asking “would the bot agree with the final merged code?” You’re asking “would the bot have caught what the humans caught, before anyone fixed it?”

Why this generalizes

I built it to backtest Macroscope, but nothing about the idea is specific to us. If you have anything that operates on pull requests — a reviewer, a linter, a policy check, a classifier, a new model — you can evaluate it against a previously-run PR whose outcome you already know.

That’s the whole trick. A live PR is a single uncontrolled sample. A historical PR is a labeled test case: you know what was wrong, you know what got flagged, you know what shipped. Replaying it lets you compare the thing you’re testing now against the thing you had at the time, on identical input. It turns “this feels better” into “here’s what it caught that the old setup missed.”

The part I’m actually proud of

It is not an impressive piece of engineering. It’s a few thousand lines that shell out to git and call the GitHub API. There are people writing infrastructure this week that would make it look like a toy, and I’m not going to pretend otherwise.

What I’m proud of is that I didn’t build it like a salesperson hacking something together. I built it the way I’ve watched our engineers build things — and that’s the only reason it’s something a customer can actually run.

I work next to a team that is unusually disciplined about how code gets shipped, and I’ve spent two years absorbing it by osmosis. So instead of vibe-coding a script until it worked once on my machine, I borrowed their process wholesale:

A spec before any code. Every non-trivial change started as a written spec — the why, the exact decision, the rules that pin down behavior, and testable acceptance criteria. Ambiguity caught in a spec costs a sentence. Ambiguity caught in code costs a rewrite.
Independent review, on purpose. The diff went through separate single-purpose passes — security, correctness, developer experience — each run on its own so no lens diluted another. A reviewer thinking only about attack surface catches things a general “looks good to me” never will.
An adversarial teardown. A pass whose only job was to read the code the way a skeptical stranger would before trusting it: dead branches, an abstraction with one caller, a clever line that should be three boring ones. It found the cleanup the feature-focused reviews skipped.
Tests that actually bite. 350+ of them, but the count isn’t the point — the point is that I flipped each fix off to confirm the test went red. A test that still passes with the bug reintroduced is theater.
A review bot on every PR. The backstop: a fresh set of eyes that didn’t know the plan. It caught a real regression the other layers missed.

None of those steps was expensive, because AI made each one nearly free. The winning move wasn’t to skip them to go faster. It was to run all of them.

Two decisions worth a look

A couple of things turned out to be more interesting than “a CLI that opens a PR,” and they’re where the borrowed discipline shows up most:

The “as opened” default degrades honestly. Recreating a PR exactly as it was opened relies on commit dates as a proxy for when things were pushed — a best-effort heuristic. When it can’t be recovered (the branch was rebased after opening), the tool doesn’t guess. It falls back to the full PR and says so, in the plan label and a one-line note. A heuristic default is fine if it’s transparent and degrades safely.

A careful reader auditing for safety finds nothing. The tool only ever talks to GitHub — no telemetry, no third-party calls, enforced in code by a request hook that throws on any other host. Your token is never put in a URL git would write to disk and never passed on a command line where ps could see it; it’s delivered through a tiny askpass helper and scrubbed from every output path, including the error handler. None of that is novel. All of it is the difference between a script and a thing you’d hand to a customer.

The actual point

The story here isn’t that a salesperson can write code now. Plenty can. It’s that the person closest to the customer — the one who heard the question in the customer’s own words and understood why they were asking — can now close the loop themselves, before the moment passes.

The distance between “I understand this problem” and “here’s a defensible thing that solves it” collapsed. What’s left is whether you care enough to walk it, and whether you’ve paid enough attention to the people who build for a living to do it without leaving a mess.

I think that’s a real edge. Not the code. The fact that someone who sits with customers all day can take their own engineering team’s standards and ship something to a real user the same afternoon they hear the problem.

The repo is here: github.com/govambam/pr-backtest. If you’ve got something that runs on PRs and you want to know how it would’ve done on the ones you already understand, point it at one and find out.