Quantitative Methods: Controlled Experiments

In 2005, Hakan Erdogmus and his colleagues published one of the better-controlled studies of test-driven development [Erdogmus2005]. They recruited university students, randomly assigned them to TDD or test-last conditions, and measured code quality and productivity. The result? TDD produced more tests and slightly better coverage, but the effect on external code quality was not significant. The study is remembered not for a dramatic finding but for being careful—and for the researchers' candid discussion of what their careful design still could not rule out.

Controlled experiments are the gold standard for causal claims. They are also expensive, difficult, and frequently misunderstood.

Experimental Design

The Null Hypothesis

p-Values

Effect Size

Statistical Power and Sample Size

Blocking Variables

Threats to Validity

Triangulating Methods

Misconceptions

p < 0.05 means the result is probably true.
It means that if there were no effect, data this extreme would arise by chance less than 5% of the time. Whether the result reflects a real phenomenon also depends on the prior plausibility of the hypothesis and on whether the analysis was pre-specified—neither of which the p-value captures.
A study using students is worthless for understanding professional developers.
Student studies are not worthless; they have limited external validity. The appropriate response is to be specific about what the study can and cannot generalize to, not to dismiss it.
Statistical significance means practical significance.
An effect can be statistically significant (i.e., reliably detectable) while being too small to matter in practice. A 2% improvement in task completion time that requires retraining the whole organization is not a useful finding just because p < 0.001.
Not finding a significant effect means the treatment doesn't work.
Failing to reject the null hypothesis means the data are consistent with no effect; it does not mean no effect exists. An underpowered study will frequently miss real effects, and most software engineering experiments are underpowered.

Check Understanding

A study reports p = 0.03. A colleague says "there's only a 3% chance this result is wrong." What is actually correct?

The correct interpretation is: if the null hypothesis were true (no effect), there would be a 3% chance of observing a result at least as extreme as this one by chance. It does not mean there is a 3% chance the result is a false positive—that probability depends on the prior probability that an effect exists, which the p-value does not capture. The colleague's statement is a common and consequential misinterpretation.

Why can't you conclude from a non-significant result (p > 0.05) that there is no effect?

A non-significant result means the data is consistent with the null hypothesis, not that the null hypothesis is true. The study may have been underpowered—too few participants to detect an effect that exists. Absence of evidence is not evidence of absence, especially in software engineering research where sample sizes are typically small.

A study of 500 developers finds that those using AI tools commit code 8% more frequently (p = 0.001, Cohen's d = 0.15). What should you conclude?

The effect is statistically significant (unlikely to be due to chance with this sample size) but very small (d = 0.15 is below the conventional threshold for a "small" effect). With 500 participants, the study has enough power to detect tiny effects. The practical significance is unclear: an 8% increase in commit frequency may or may not reflect any meaningful change in productivity, especially given the construct validity problems with using commits as a productivity proxy.

The following study design has a threat to internal validity. Identify it: "We introduced AI coding tools to our team in January and measured productivity (PRs merged) in February. Productivity increased 15%."

Without a control group, you cannot attribute the change to the AI tools. Many things change between January and February: team composition, project phase, sprint planning, seasonal effects. This is a pre-post design (before and after the intervention) with no control, which cannot rule out alternative explanations. A stronger design would use a control group— another team that did not receive the tool—to compare against.

Exercises

Power Calculation (15 minutes)

You want to study whether a new code review tool reduces defect escape rate. From prior data, you know the standard deviation of defect escape rate across teams is about 5 defects per release. You want to detect a difference of 3 defects per release (about a 60% reduction from a typical baseline of 5). Using the formula n = 2(z_α + z_β)²σ²/δ² with α = 0.05 (z = 1.96) and 80% power (z = 0.84), calculate the minimum number of teams per condition. What does this tell you about the feasibility of running this study?

Design Critique (20 minutes)

Read the methods section of Erdogmus et al. 2005 or the GitHub Copilot study (Peng et al. 2023). List three design choices the researchers made and for each: (a) explain why they made that choice, and (b) describe one threat to validity it does not address.

Build Your Own (20 minutes)

Design a controlled experiment to test one of the following claims. Specify: the treatment and control conditions, how you will assign participants, what you will measure (and how), a realistic sample size, and two threats to external validity.