How to Tell If You're Using AI Well
Volume metrics lie about AI-assisted work. The signal isn't in the code, it's in whether you did the thinking before the code got written. A scorecard built on two altitudes of planning and the two kinds of waste that come from skipping them.
TL;DR
- Volume metrics lie about AI-assisted work. PRs and deploys go up regardless of whether the work was good.
- Working with AI well is doing the thinking up front. The thinking happens at two altitudes, and most teams do one and skip the other.
- Strategic planning is what you’re building. Skip it and you get churn. Mid-flight scope changes, work that doesn’t match the plan, throwing away yesterday’s code.
- Tactical planning is what done looks like for each piece. Skip it and you get bugs. Skipping is unforced now that the cost of writing real tests has gone to zero.
- Tests, boundaries, and benchmarks are the output of tactical planning. They’re also how you read the work when you’re not reading every line.
- Track the two kinds of waste separately. Churn and bugs feel like the same thing (“we’re going slow”) but they have different fixes. Naming which one you have is the diagnostic.
The numbers have never looked better. PRs merged, deploys shipped, lines of code written. Every DORA-style metric your team tracks is probably up since you started using AI tools. Mine were too.
The problem is those numbers used to mean something. High deploy frequency meant the team had good release discipline and a working pipeline. High PR throughput meant engineers were moving fast and reviewing each other’s work. Now they mean a tool typed quickly. The volume went up. What it represents did not. You can’t read every line anymore. The volume is too high and the diffs are too long. So if the signal isn’t in the code, where is it?
I’ve been circling this question since January, when I wrote I Was Wrong About AI about the mindset shift that changed how I work with these tools. Then in April I wrote You’re the Broker, Not the Builder about the habits that follow from that shift. Making the AI plan before it codes. Staying in the diff. Building process back in because AI variance demands it. Both posts were about how to work. This one is about how to tell if it’s working.
Working with AI well is doing the thinking before the code gets written. Not during. Not after. Before. That thinking happens at two altitudes. There’s strategic thinking, which is clarity on what you’re building and why. And there’s tactical thinking, which is a concrete picture of what done looks like for each piece before a line gets written. Skip either one and you don’t get a metrics problem. You get a waste problem. The kind that shows up as churn, or bugs, or both, and feels like going slow even when the numbers say you’re going fast.
Strategic Planning, and the Churn From Skipping It
Strategic planning is the agreement on what you’re building before you start building it. Not a vague description of the feature, but the actual shape of the thing. Where it starts and stops. Which systems it touches and which it doesn’t. The decisions that would be expensive to reverse once the code exists. This is the work that happens before you open a terminal.
When you skip it, you get churn. The feature scope expands mid-session because nobody defined the boundary. Work diverges from whatever loose mental picture existed at the start. “Oh and could you also…” three times in a row, and you’re building something different than you set out to build. Sometimes you throw away yesterday’s code. Sometimes the code survives but was rewritten twice to get there. The bugs that come from rushed pivots are real, but they’re not the headline waste. The headline waste is the churn itself. Even when the final product works, the work to get there was inflated.
Three questions work as a gut check after a feature ships. Was there a written plan before work started? Did the final work match that plan, or did it drift? And if it drifted, was that a deliberate decision logged somewhere, or did it just happen because momentum carried it there? Those three questions reveal a lot. Intentional drift is fine. Untracked drift is the tell that the plan wasn’t doing its job.
The most direct tool I’ve found for this is Claude Code’s plan mode. Describe the feature, ask for a markdown spec before any code gets written, then review and push back on the spec. “What happens when this field is missing? Why are we adding a column instead of extending the existing table?” Those questions are cheap to ask in a spec and expensive to answer after the migration runs. Writing real specs also surfaces a specific inversion. At some point you notice the spec is describing what the code already does, not what you intended to do. That inversion is a signal you’ve been drifting without noticing. Getting back to plans-before-code is how you catch it.
Tactical Planning, and the Bugs From Skipping It
Tactical planning is the concrete picture of done for the unit of work in front of you. Not “implement the rate limiter,” but what the rate limiter accepts and returns, what its failure modes are, what a test has to prove before you call it working, and what a benchmark has to show before you call it fast enough. It is the boundary around a piece of code, drawn before the code exists. Small enough to hold in your head. Specific enough that you could hand it to someone else and they would build the same thing.
When you skip that step, you get bugs. Not always, but reliably over time. An edge case you would have caught if you’d thought about it for three minutes shows up as a production incident six weeks later. A performance assumption gets embedded in the implementation and nobody notices until the table has ten million rows. A function that was supposed to be isolated starts depending on module state that wasn’t in the original mental picture, because the boundary was never written down. These aren’t hard bugs to find in hindsight. Most of them feel obvious once you’re in the middle of the postmortem. That feeling of obviousness is the tell. It means the thought wasn’t expensive. It was just skipped.
Skipping tactical planning in 2026 is a different kind of mistake than it used to be. AI tools are genuinely good at writing tests, and they write them quickly. The cost of “I should add an integration test for the retry path” has gone to near zero. You can describe the scenario, ask for a test, and have it in the file in under a minute. Which means skipping the test is no longer a time trade-off. You are not saving time. You are skipping the thought that would have produced the test. Skipping tactical planning is a thought-skipper now, not a time-saver. That distinction matters because thought-skippers compound. Time-savers have obvious limits.
The reframe I keep coming back to is that tests, boundaries, and benchmarks are not a separate concern from tactical planning. They are its output. They are the artifact that proves the planning happened. And in a world where you are no longer reading every line the AI wrote, they are also how you read the work. The test suite is the readable artifact. When I review a PR from a session I wasn’t in, I go to the tests first. Not to the implementation. The tests tell me what the code is supposed to do, what cases it handles, and what the author was thinking. If the tests are thin, I don’t actually know if the implementation is right. I only know nothing said no.
There are two questions I run through when I look at a test. The first is whether I can tell what the code is supposed to do from the test name alone, without reading the implementation. If I can’t, the test is describing structure rather than behavior, and that’s a sign the planning was vague. The second question is whether the tests would catch a real regression if a teammate changed the implementation tomorrow. Not a refactor complaint. A real functional break. Tests that only pass when the internals look a certain way are not covering behavior. They are covering an artifact of the first draft. Both questions are fast. Both are worth asking before you ship.
The bar should also get sharper over time. Every issue you ship is a pattern you can encode in the next round of tests. That is the same discipline behind retros, postmortems, standups, and the process habits that have always made engineering teams better. Refine your guardrails based on what they missed last time.
The Scorecard
The two altitudes give a frame. Treat any finished piece of work as a diagnostic. Pick something your team shipped in the last two weeks. It could be a single session, a sprint, or a whole feature. Apply it after the fact and it tells you where the planning was thin. Apply it before you ship and it tells you what you still haven’t thought through. Either way, the questions reveal the planning shape of the work regardless of whether the outcome looked fine.
- Was there a written plan before the work started?
- Did the final work match the plan, or did it drift?
- If it drifted, can you point to the moment you decided to drift and why?
- Are the integration tests strong enough that you’d trust them more than reading the code?
- When something is breaking now, is it a strategic gap (the wrong thing got built) or a tactical gap (the right thing got built wrong)?
If you can’t answer question 5, that’s the answer. If you can’t tell which kind of gap, neither altitude got planned. When work goes wrong and you can’t name which kind of wrong it is, you end up with one big undifferentiated rework number. You try to fix everything at once. The teams that get this right can separate churn from bugs, even under pressure, and they fix them differently. That separation is the thing. It’s not a framework. It’s just knowing what you’re looking at.
The AI tools are going to keep improving, and the volume metrics that already fail to signal quality are going to get worse at it, not better. PRs will go up. Deploys will go up. Lines of code will go up faster than ever, because the tools are getting more capable, not less. You will not be able to tell from throughput numbers whether your team is doing good work or generating plausible-looking output that will need to be rewritten in three months. The signal was already gone. It is going to be further gone.
The thing that does not change is the value of thinking up front. All the waste in AI-assisted work is unplanned thinking showing up late. Churn is strategic thinking that got deferred until after the code existed. Bugs are tactical thinking that got skipped because the test was a few minutes of work nobody took. Both are recoverable. Neither is invisible. But you have to name them separately before you can fix them separately. The teams that figure this out early get the leverage that AI tools are supposed to deliver. The ones that don’t accumulate code they can’t fully trust and a rework rate that looks like normal noise.
The scorecard in this post is not a fourth habit to add on top of the previous two posts. It is how you check whether the habits from You’re the Broker, Not the Builder are actually holding, and whether the mindset shift from I Was Wrong About AI translated into something real. Did the planning happen? Did the work match the plan? Are the tests saying something true? Those questions have the same answer when the habits are working and when they aren’t. That is what makes them worth asking.
David Kerr is the founder of Kerrberry Systems. He builds custom software for businesses that want a partner, not a vendor. Find him on LinkedIn or GitHub.