Lab: Does Test-Driven Development Work?

Learning Goals

Lesson

i
"""Replicate Mann-Whitney U and Cliff's delta from Fucci et al. (2016)."""

import polars as pl
from scipy import stats


def cliffs_delta(a, b):
    n = len(a) * len(b)
    greater = sum(1 for x in a for y in b if x > y)
    less = sum(1 for x in a for y in b if x < y)
    return (greater - less) / n


df = pl.read_csv("data/fucci_tdd.csv")
tdd = df.filter(pl.col("approach") == "TDD")
tld = df.filter(pl.col("approach") == "TLD")

published = {"TESTS": (0.052, 0.19), "QLTY": (0.38, 0.12), "PROD": (0.89, 0.02)}
print(f"{'Outcome':<8} {'U':>8} {'p':>10} {'delta':>8} {'pub_p':>8} {'pub_d':>8}")
for outcome in ["TESTS", "QLTY", "PROD"]:
    a = tdd[outcome].drop_nulls().to_numpy()
    b = tld[outcome].drop_nulls().to_numpy()
    u, p = stats.mannwhitneyu(a, b, alternative="two-sided")
    delta = cliffs_delta(a, b)
    pub_p, pub_d = published[outcome]
    print(
        f"{outcome:<8} {u:>8.1f} {p:>10.3e} {delta:>8.3f} {pub_p:>8.3f} {pub_d:>8.3f}"
    )

Check Understanding

Fucci et al. found no significant difference in QLTY or PROD between TDD and TLD. What are two possible explanations for this null result?

One explanation is that TDD and TLD genuinely produce equivalent outcomes for quality and productivity, at least in short controlled experiments with professional developers, and the null result reflects the true state of affairs. A second explanation is that the study was underpowered: with the number of participants in the study, the statistical tests could not reliably detect a small-to-medium effect even if one existed. Dyba et al. showed that most SE experiments lack sufficient power to detect realistic effects [Dyba2006], so a null result often means "we could not see it" rather than "it is not there."

The following code produces an error when run on the Fucci data. What is the bug and how do you fix it?
tdd = df.filter(pl.col("approach") == "TDD")
tld = df.filter(pl.col("approach") == "TLD")
u, p = stats.mannwhitneyu(tdd["PROD"], tld["PROD"])

stats.mannwhitneyu requires NumPy arrays or Python lists, not Polars Series. Passing a Polars Series directly will raise a TypeError. The fix is to call .to_numpy() and drop null values before passing the data to the test:

a = tdd["PROD"].drop_nulls().to_numpy()
b = tld["PROD"].drop_nulls().to_numpy()
u, p = stats.mannwhitneyu(a, b, alternative="two-sided")

The alternative="two-sided" argument is also important: the default in older versions of scipy is "two-sided", but specifying it explicitly matches the analysis in the paper and avoids confusion about which alternative hypothesis is being tested.

What does it mean for a study to be underpowered, and why might an underpowered study produce a null result even when a real effect exists?

Statistical power is the probability that a test will correctly reject the null hypothesis when a real effect exists. A study is underpowered when its sample size is too small to give the test a reasonable chance of detecting an effect of the expected size. If power is 0.30 for a medium effect, the test will fail to detect that effect 70% of the time purely due to sampling noise. An underpowered study that finds p > 0.05 cannot conclude there is no effect; it can only conclude that its sample was not large enough to see one if it is there.

Nagappan et al. found TDD reduced defects by 40-90% in industrial case studies, while Fucci et al. found little effect in a controlled experiment. How can both findings be correct?

The two studies measured different things in different contexts. Nagappan et al. studied real teams on real projects over extended periods, where TDD adoption likely came bundled with other practices: more careful design, better discipline, and organizational support for quality. Fucci et al. isolated the test-first vs. test-last distinction in a short controlled task, removing most of the surrounding context. It is plausible that TDD's benefits in industry come from the culture and habits it encourages rather than from the mechanical act of writing tests before code. The external validity of a controlled experiment is always limited, and the internal validity of an industrial case study is always limited. Together they are more informative than either alone.

Exercises

Reproduce Table 3

Run replicate_fucci.py and compare your output to the published values for all three outcomes. Report your reproduced p-values and Cliff's delta values alongside the published values in a table. For any discrepancy larger than 0.01 in either p or delta, write one sentence hypothesizing what might cause the discrepancy, considering differences in exclusion criteria, null value handling, or test parameterization.

Effect of exclusion criteria

Rerun the Mann-Whitney U tests for all three outcomes without applying any exclusion criteria used in the original analysis (include all rows in the dataset). Report how the p-values change compared to the analysis with exclusions. Write two sentences explaining whether the exclusion decisions strengthen or weaken the main finding, and whether you think the original authors were justified in applying them.

Operationalizing rhythm

Fucci et al. [Fucci2017] found that the rhythm of test-code interleaving matters more than whether tests are written first or last. Write one sentence defining an operationalization of "rhythm" that could be measured from a version control history (for example, using commit timestamps and commit messages or file names to identify test commits vs. implementation commits). Write a second sentence describing the dataset you would need to measure this operationalization across a sample of open-source projects.

Power analysis

Using scipy.stats or a power analysis library such as statsmodels.stats.power, compute the statistical power of a Mann-Whitney U test with the actual sample sizes in the Fucci dataset to detect a medium effect (Cliff's delta ≈ 0.3). Report the power value. Write two sentences explaining what this power level implies for how much confidence you should place in the null results for QLTY and PROD, given that both outcomes showed p > 0.05.

Reconciling contradictory findings

Nagappan et al. found TDD reduced defects by 40-90% in industrial case studies [Nagappan2008a], while Fucci et al. found little effect in a controlled experiment. Write three sentences reconciling these findings, using the concepts of external validity, confounds, and effect size. Your answer should explain how both studies can be correct without either being dishonest or incompetent.