Quantitative Methods: Controlled Experiments

In 2005, Hakan Erdogmus and his colleagues published one of the better-controlled studies of test-driven development [Erdogmus2005]. They recruited university students, randomly assigned them to TDD or test-last conditions, and measured code quality and productivity. The result? TDD produced more tests and slightly better coverage, but the effect on external code quality was not significant. The study is remembered not for a dramatic finding but for being careful—and for the researchers' candid discussion of what their careful design still could not rule out.

Controlled experiments are the gold standard for causal claims. They are also expensive, difficult, and frequently misunderstood.

Experimental Design

A controlled experiment manipulates one or more independent variables and measures their effect on one or more dependent variables, while holding other factors constant
The treatment group receives the intervention being tested; the control group does not
Randomization assigns participants to conditions randomly, which distributes unknown confounding variables evenly across groups—the mechanism that makes causal claims defensible
Blinding: in a single-blind study, participants do not know which condition they are in; in a double-blind study, neither participants nor experimenters know until analysis
- Full blinding is rarely possible in software engineering studies (you cannot hide from a developer that they are using TDD)

The Null Hypothesis

The null hypothesis (H₀) is the assumption that there is no effect: the treatment makes no difference
The alternative hypothesis (H₁) is that there is an effect
You never prove H₁; you either reject H₀ (when your data is inconsistent with "no effect") or fail to reject it
Failing to reject H₀ does not mean the null is true—it may mean your study was not powerful enough to detect a real effect

p-Values

A p-value is the probability of observing data at least as extreme as what you observed, if the null hypothesis were true
It is not the probability that the null hypothesis is true
It is not the probability that your result is a false positive
It is not the probability that you will replicate
p < 0.05 means: if there were no effect, you would see a result this extreme or more extreme less than 5% of the time by chance
The 0.05 threshold is a convention from the 1920s, not a law of nature [Fisher1925]

Effect Size

Statistical significance tells you whether an effect is likely to be real; effect size tells you how large it is
Cohen's d measures the difference between two means in standard deviation units:
- d = 0.2: small effect
- d = 0.5: medium effect
- d = 0.8: large effect
- These benchmarks are rough guides, not thresholds
A study with thousands of participants can find statistically significant effects that are too small to matter in practice
Always report effect size alongside p-values; one without the other is incomplete

Statistical Power and Sample Size

Statistical power is the probability that your study will detect an effect if one actually exists
Power depends on: sample size, effect size, and the significance threshold (alpha)
A typical target is 80% power: you accept a 20% chance of a false negative
Most software engineering experiments are dramatically underpowered [Kampenes2007]: studies with 20-30 participants can only detect very large effects (d > 0.8)
The consequence of underpowering: either you miss real effects, or the effects you do detect are inflated (the "winner's curse")
Calculate required sample size before running the study, not after

Blocking Variables

Blocking controls a known nuisance variable by grouping experimental units before randomization, guaranteeing that its effect is distributed evenly across conditions
Randomization distributes unknown confounders; blocking handles known ones explicitly
Classic SE example: developer experience. A random assignment of 40 developers might give all the seniors to one group by chance. Blocking first divides developers into experience strata, then randomizes within each stratum
Benefit: removing the nuisance factor's variance from the error term increases statistical power without adding participants
In a within-subjects design, each participant applies both treatments to different tasks, acting as their own control
- Advantage: eliminates between-person variation entirely
- Disadvantage: vulnerable to order effects — doing a task second is different from doing it first, regardless of technique
Many SE experiments on inspection and testing techniques are implicitly within-subjects: each person uses multiple methods on multiple programs. Whether and how the blocks were analyzed matters for interpreting the results

Threats to Validity

Validity threats fall into four categories [Wohlin2000] [Campbell1963]:
Conclusion validity: did you use the right statistical procedure, and was the study powerful enough to detect an effect if one exists?
- Threats: underpowered design, violated test assumptions (e.g., applying a t-test to heavily skewed data), multiple comparisons without correction
Internal validity: can you conclude that the treatment caused the outcome?
- Threats: confounding variables, selection bias, maturation effects
- SE-specific behavioral threats, which arise because people are the experimental subjects:
  - Learning effect: participants improve on a later task through practice, making the later condition look better regardless of which technique it used
  - Novelty effect: any new tool or method gets a temporary boost simply because it is new — the same mechanism as the Hawthorne effect
  - Boredom effect: performance declines as the experiment drags on; conditions applied later look worse
  - Unconscious formalization: participants apply methods more carefully than usual because they know they are being observed
Construct validity: does your measurement actually capture the concept you care about?
- Threats: using commit counts as a proxy for productivity; using a toy task to measure skills that only appear in long, complex work
External validity: do your findings generalize beyond your sample and setting?
- Threats: students are not professional developers; a 90-minute task is not a six-month project; one programming language is not all programming languages
Many SE experiments have strong internal validity but weak external validity, which limits what you can conclude
Pair programming research illustrates the problem precisely: most controlled experiments assigned students to pair or solo conditions on a small isolated task with a randomly assigned partner — none of which reflects how pair programming is used in industry. Practitioners have largely ignored these results. A grounded theory study of 60+ recorded sessions from a dozen companies found that what matters is the specific knowledge gap between partners — the variable the experiments controlled away [Sadowski2019]

Triangulating Methods

Interruption research used three complementary approaches on the same phenomenon: controlled lab experiments (measured resumption lags precisely but under artificial conditions), cognitive models (predicted error rates without any participants), and observational studies (found that workers switch contexts every three minutes — far faster than any lab task) [Mark2008]
Each method answered questions the others could not: the lab studies established causal mechanisms; the observational studies established that the phenomenon occurs at scale in real work
Good research programs use controlled experiments for causal claims and observational studies for ecological validity, not one or the other

Misconceptions

p < 0.05 means the result is probably true.: It means that if there were no effect, data this extreme would arise by chance less than 5% of the time. Whether the result reflects a real phenomenon also depends on the prior plausibility of the hypothesis and on whether the analysis was pre-specified—neither of which the p-value captures.
A study using students is worthless for understanding professional developers.: Student studies are not worthless; they have limited external validity. The appropriate response is to be specific about what the study can and cannot generalize to, not to dismiss it.
Statistical significance means practical significance.: An effect can be statistically significant (i.e., reliably detectable) while being too small to matter in practice. A 2% improvement in task completion time that requires retraining the whole organization is not a useful finding just because p < 0.001.
Not finding a significant effect means the treatment doesn't work.: Failing to reject the null hypothesis means the data are consistent with no effect; it does not mean no effect exists. An underpowered study will frequently miss real effects, and most software engineering experiments are underpowered.

Check Understanding

A study reports p = 0.03. A colleague says "there's only a 3% chance this result is wrong." What is actually correct?

The correct interpretation is: if the null hypothesis were true (no effect), there would be a 3% chance of observing a result at least as extreme as this one by chance. It does not mean there is a 3% chance the result is a false positive—that probability depends on the prior probability that an effect exists, which the p-value does not capture. The colleague's statement is a common and consequential misinterpretation.

Why can't you conclude from a non-significant result (p > 0.05) that there is no effect?

A non-significant result means the data is consistent with the null hypothesis, not that the null hypothesis is true. The study may have been underpowered—too few participants to detect an effect that exists. Absence of evidence is not evidence of absence, especially in software engineering research where sample sizes are typically small.

A study of 500 developers finds that those using AI tools commit code 8% more frequently (p = 0.001, Cohen's d = 0.15). What should you conclude?

The effect is statistically significant (unlikely to be due to chance with this sample size) but very small (d = 0.15 is below the conventional threshold for a "small" effect). With 500 participants, the study has enough power to detect tiny effects. The practical significance is unclear: an 8% increase in commit frequency may or may not reflect any meaningful change in productivity, especially given the construct validity problems with using commits as a productivity proxy.

The following study design has a threat to internal validity. Identify it: "We introduced AI coding tools to our team in January and measured productivity (PRs merged) in February. Productivity increased 15%."

Without a control group, you cannot attribute the change to the AI tools. Many things change between January and February: team composition, project phase, sprint planning, seasonal effects. This is a pre-post design (before and after the intervention) with no control, which cannot rule out alternative explanations. A stronger design would use a control group— another team that did not receive the tool—to compare against.

Exercises

Power Calculation (15 minutes)

You want to study whether a new code review tool reduces defect escape rate. From prior data, you know the standard deviation of defect escape rate across teams is about 5 defects per release. You want to detect a difference of 3 defects per release (about a 60% reduction from a typical baseline of 5). Using the formula n = 2(z_α + z_β)²σ²/δ² with α = 0.05 (z = 1.96) and 80% power (z = 0.84), calculate the minimum number of teams per condition. What does this tell you about the feasibility of running this study?

Design Critique (20 minutes)

Read the methods section of Erdogmus et al. 2005 or the GitHub Copilot study (Peng et al. 2023). List three design choices the researchers made and for each: (a) explain why they made that choice, and (b) describe one threat to validity it does not address.

Build Your Own (20 minutes)

Design a controlled experiment to test one of the following claims. Specify: the treatment and control conditions, how you will assign participants, what you will measure (and how), a realistic sample size, and two threats to external validity.

Pair programming reduces defect rates
Code review with AI assistance is faster than code review without it
Developers who write tests first produce better-designed code