How to Be Wrong

Suppose your manager asks you next week to demonstrate that the AI coding tools your company just purchased are worth the subscription cost. What would you measure? Lines of code generated? Tickets closed? A survey asking whether developers feel more productive? Each of those answers is wrong in a specific way. This lesson describes twelve of the most common mistakes currently being made in published research and in organizations. At least one of them is probably something you, your team, or your employer is doing right now.

Counting Lines of Code Generated

Timing Artificial Tasks

Measuring Only the Easy Half

Before/After With No Control Group

Asking Developers If They Feel More Productive

Goodhart's Law Revisited

Treating Adoption Rate as a Success Metric

Comparing Volunteers to Non-Volunteers

Treating Suggestion Acceptance Rate as a Quality Signal

Comparing AI to Nothing

Check Understanding

Your company surveys developers three weeks after rolling out an AI coding assistant. Seventy-eight percent report feeling more productive. Your manager says this proves the tools are working. Identify two specific flaws in this conclusion.

First, the survey was conducted three weeks after adoption, which falls within the novelty period: developers are typically more engaged and positive about new tools during the initial weeks, and that enthusiasm inflates self-reported productivity regardless of actual performance. Second, feeling more productive is not the same as being more productive. Self-report surveys measure perception, which is also distorted by the Hawthorne effect and social desirability bias.

A study compares fifty developers who volunteered to use an AI assistant against fifty who did not. The AI group closes 25% more tickets per week. What is the most likely alternative explanation for this result, and what design change would address it?

The most likely alternative explanation is selection bias: developers who volunteered to use the tool differ from those who did not in ways that predict productivity independently of the tool. A design change that would address this is random assignment: randomly give the tool to half the developers rather than letting them self-select, so that the two groups are comparable on average.

The following measurement plan contains at least two flaws. Identify them and describe a better approach: "To evaluate our AI tools, we will measure lines of code committed per developer per week, comparing the two weeks before and two weeks after tool adoption."

Lines of code has low construct validity as a productivity measure, and a two-week window falls entirely within the novelty period. A better approach would use an outcome-linked metric such as deployment frequency paired with change failure rate, or cycle time from ticket creation to production deployment, and extend the measurement window to at least three to six months. Including a team that did not adopt the tools would provide a control group to separate the AI effect from other concurrent changes.

A vendor announces that their AI assistant's suggestion acceptance rate rose from 25% to 40% over the past quarter and calls this a "quality improvement." What is wrong with this interpretation?

Acceptance rate measures whether a suggestion looked plausible, not whether the accepted code was correct, secure, or maintainable. A rise in acceptance rate can reflect better suggestions, but it can equally reflect increased time pressure, a shift toward simpler tasks where AI does better, or habituation (developers becoming less critical over time). Without a separate measure of the quality of accepted suggestions, such as defect rates or code review rejection rates, acceptance rate tells you nothing about whether the tool is producing better code.

Exercises

Audit Your Own Organization (20 minutes)

List three metrics your organization currently uses (or has used recently) to evaluate the impact of a tool or process change on developer productivity. For each metric, identify which of the failure modes from this chapter it falls into, and briefly explain why. If you believe a metric avoids all of the traps, write one sentence explaining why you think so.

Redesign a Flawed Study (15 minutes)

A company wants to know whether their AI coding tool improves developer productivity. Their current plan is to survey developers about perceived productivity after two weeks, compare responses to a pre-adoption baseline survey, and report the percentage who say they feel more productive.

Identify at least three problems with this design, then describe a better approach in a few sentences, specifying what you would measure, over what time period, and how you would construct a comparison group.

Find the Baseline (10 minutes)

Find a recent vendor claim about an AI tool's impact on developer productivity, such as a blog post, marketing page, or case study. Identify the baseline the claim uses: what does the AI-assisted group get compared to? Then write two sentences describing a more informative baseline and why it would change the interpretation of the result.