Defining Productivity

Before you can measure whether AI makes programmers more productive, you have to decide what "productive" means. That turns out to be surprisingly hard.

In 2021, researchers at Microsoft published a framework called SPACE, which argued that developer productivity has five dimensions: Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow [Forsgren2021]. The paper's central argument is that no single metric captures productivity, and that organizations that optimize for one dimension (say, commit frequency) often degrade others (say, satisfaction or code quality). If you only measure what is easy to count, you will optimize for the wrong thing.

Why "Programmer Productivity" Is Hard to Define

Construct Validity

Proxy Metrics and Their Failure Modes

In the early 1980s, Apple measured programmer productivity by lines of code per day. l Atkinson, who wrote the QuickDraw graphics library, once submitted a timesheet showing -2000 lines: he had replaced 2000 lines with a faster, shorter algorithm [Sadowski2019]. The metric penalized the improvement.

The DORA Metrics as a Worked Example

How Definitional Choices Shape Conclusions

Reframing: Productivity as Waste Reduction

Misconceptions

Writing more code means being more productive.
This is the intuition behind lines-of-code metrics, and it is wrong in both directions: a developer who removes a thousand lines of tangled code and replaces them with a hundred clear ones has improved the codebase, not harmed it.
DORA metrics measure individual developer productivity.
They measure team and organizational delivery performance. Using deployment frequency or change failure rate to evaluate an individual developer conflates the level of analysis and ignores the team and system factors that dominate individual variation.
If a metric improves, the underlying thing it measures must be improving.
Goodhart's Law applies as soon as people know they are being measured. Metrics that look better because of behavior change around the measurement are not evidence of real improvement.
Productivity is an objective property that can be measured if you have the right data.
Every operationalization of productivity embeds a choice about what matters, such as speed, quality, satisfaction, or business value. Those are values choices, not measurement problems.
Better measurement tools produce better management.
Measurement is not management. Even complete data requires qualitative insight to explain causation, and organizations change constantly, making any model stale. Improving productivity requires behavior change, which requires judgment and trust, not more dashboards [Sadowski2019].

Check Understanding

What is construct validity, and why does it matter when evaluating productivity metrics?

Construct validity is the degree to which a measurement captures the concept it is meant to represent. It matters because a metric with low construct validity will mislead you: you can improve the number without improving the underlying thing you care about. For productivity, a metric like "commits per day" has low construct validity because developers can increase it without becoming more productive (and may actually become less productive while doing so).

A manager announces that their team's productivity increased 20% after adopting an AI coding assistant, measured by pull requests merged per week. Identify two specific threats to the validity of this conclusion.

First, pull requests merged per week is a proxy metric with low construct validity: developers may respond to the measurement by splitting work into smaller PRs without increasing the actual value delivered. Second, this is a before-and-after comparison without a control group, so any number of things that changed at the same time (e.g., team composition, codebase maturity, or sprint planning practices) could explain the increase. Adoption of an AI tool is one possible cause among many.

The following research design contains a flaw related to operationalization. Identify and fix it: "We measured the impact of code review on productivity by counting defects closed in the month after code review was introduced."

The flaw is that "defects closed" is an activity metric that reflects ticket-closing behavior, not actual defect elimination. After introducing code review, teams may close old tickets more aggressively or change their defect-reporting practices, inflating the count. A better operationalization would measure defect escape rate (i.e., defects found in production per release) before and after the intervention, using a consistent defect definition.

Exercises

Operationalize It (15 minutes)

Choose one of the following abstract concepts and write a specific, measurable operationalization for it. Then write one paragraph explaining the ways your operationalization might fail to capture the concept:

DORA in the Wild (20 minutes)

Look up the most recent State of DevOps Report (published annually by Google Cloud / DORA). Find one claim about high-performing teams and trace it back to the specific metric used. Then explain what this metric does not capture that you would want to know before acting on the claim?

Goodhart's Law in Practice (10 minutes)

Think of a metric you have seen used to evaluate developer performance at your current or a previous organization, or from a case study you have read. Describe one specific way that metric was or could be gamed, i.e., how someone could improve the number without improving the thing the number is supposed to measure.