Why Should You Care What Researchers Found?

Learning Goals

Lesson

i
"""Compute development-time percentiles from Prechelt data."""

import polars as pl

df = pl.read_csv("data/jccpprtTR.csv")
print("All languages:")
print(
    df.select(
        pl.col("whours").quantile(0.10).alias("p10"),
        pl.col("whours").quantile(0.50).alias("p50"),
        pl.col("whours").quantile(0.90).alias("p90"),
    )
)
p10 = df["whours"].quantile(0.10)
p90 = df["whours"].quantile(0.90)
print(f"90th/10th ratio: {p90 / p10:.1f}X")

java = df.filter(pl.col("lang") == "Java")
print("\nJava only:")
print(
    java.select(
        pl.col("whours").quantile(0.10).alias("p10"),
        pl.col("whours").quantile(0.50).alias("p50"),
        pl.col("whours").quantile(0.90).alias("p90"),
    )
)

Check Understanding

What is wrong with claiming 10X productivity by comparing the best programmer to the worst?

Comparing extremes guarantees an inflated ratio regardless of the underlying distribution. Every dataset has a maximum and a minimum; the ratio between them tells you about the tails, not about typical performance. Prechelt found a 105X ratio across all languages, but the 90th/10th percentile ratio in the same data is far smaller. Picking a comparison method after seeing the data is also a form of cherry-picking: you can get almost any number you want by choosing the right pair of values to compare.

The following code is supposed to compute the 10th and 90th percentile ratio, but it contains a bug. What is wrong and how do you fix it?
p10 = df["whours"].quantile(10)
p90 = df["whours"].quantile(90)
ratio = p90 / p10

The quantile method in Polars takes a value between 0.0 and 1.0, not a percentage. Passing 10 asks for the value at the 10th multiple of the distribution, which is out of range and will either raise an error or return None. The fix is:

p10 = df["whours"].quantile(0.10)
p90 = df["whours"].quantile(0.90)
ratio = p90 / p10
What does Goodhart's Law predict will happen if a company starts evaluating developers by lines of code written per week?

Developers will write more lines of code per week, because that is the measure they are being evaluated on. This can happen in several ways: removing whitespace compression, splitting statements across multiple lines, avoiding helper functions that reduce duplication, or simply writing verbose code where concise code would do. The number goes up; the underlying quality of the software does not necessarily follow. Eventually the metric stops reflecting what the company actually cares about, which is working software delivered efficiently.

How does the Noda DevEx framework improve on simple productivity metrics like lines-per-day or tickets-closed?

The DevEx framework identifies three distinct dimensions of developer experience: feedback loops, flow state, and cognitive load. A single metric like lines-per-day collapses all three into one number, which means you cannot tell whether a low score reflects slow feedback from CI/CD pipelines, constant interruptions, or a genuinely hard problem. By measuring each dimension separately, teams can diagnose which specific aspect of their environment is limiting productivity and target improvements accordingly. A developer might be highly productive despite closing few tickets because they are carrying heavy cognitive load imposed by poorly documented legacy code.

Devanbu et al. found that developers held beliefs about their own projects that contradicted evidence in the project's own data. What does this suggest about relying on team leads' opinions when evaluating a new tool?

It suggests that opinion alone is unreliable evidence, even when the person has direct experience with the project in question. A team lead's intuition is shaped by memorable incidents, recent events, and the subset of the codebase they interact with most frequently. Data collected systematically across the whole project often tells a different story. This does not mean team leads are wrong about everything, but it does mean their opinions should be treated as hypotheses to be checked against data rather than as conclusions.

Exercises

Reproduce Prechelt's Java Claim

Load the Prechelt data file and filter it to Java submissions only. Find the minimum and maximum values of whours in that subset and compute their ratio. Then compute the same ratio for each other language in the dataset and present all ratios in a single Polars dataframe, sorted from largest to smallest. Write a one-sentence interpretation of what the table tells you about language choice and development time variability.

DevEx Self-Assessment

Pick two programming tasks you completed in the past two weeks: one that felt frustrating and one that felt productive. For each task, rate it from 1 to 5 on the three DevEx dimensions (feedback loops, flow state, cognitive load) and write a sentence explaining each rating. Then identify which single dimension most limited your productivity on the frustrating task and describe one concrete change to your environment or process that would improve that dimension.

Percentiles by Language

Using the Prechelt data, compute the 25th, 50th, and 75th percentile of whours for each programming language as a single Polars dataframe with one row per language. Sort the rows by median development time. Identify the language with the widest interquartile range (75th minus 25th percentile) and write a sentence explaining what a wide range implies about predictability for project managers estimating delivery time.

Gaming a Metric

Choose one of the questions from the Begel and Zimmermann top-ten list [Begel2014] that involves a measurable outcome. Describe a specific metric that a team might use to track progress on that question. Then describe in concrete terms how a developer or manager could game that metric: that is, how they could improve the number without improving the underlying outcome the metric was designed to measure. Explain which aspect of Goodhart's Law your example illustrates.

Belief Versus Evidence

The Devanbu et al. study [Devanbu2016] found that developers' beliefs often do not match the evidence in their own projects. Write a short paragraph proposing one hypothesis about why this gap exists: for example, why might a developer's experience with a particular module lead them to a false belief about the codebase as a whole? Then propose a study design that a team could run with their own project data to test whether a specific commonly-held belief is accurate. Specify what data you would collect, how you would collect it, and what result would count as evidence against the belief.