Study Design and Replication

Learning Goals

Types of Studies in SE Research

What Replication Means and Why It Matters

Pre-Registration

The Replication Crisis in Software Engineering

Observational Data and Causal Claims

Reading a Paper Critically

Code

i
"""Compare replication results to published Fucci et al. (2016) values."""

import polars as pl

published = pl.DataFrame(
    {
        "outcome": ["TESTS", "QLTY", "PROD"],
        "pub_p": [0.052, 0.380, 0.890],
        "pub_delta": [0.19, 0.12, 0.02],
    }
)

# Load your replication results (produced by tddlab/replicate_fucci.py)
try:
    replicated = pl.read_csv("data/fucci_replication.csv")
    combined = published.join(replicated, on="outcome")
    combined = combined.with_columns(
        (pl.col("rep_p") - pl.col("pub_p")).abs().alias("p_diff"),
        (pl.col("rep_delta") - pl.col("pub_delta")).abs().alias("delta_diff"),
    )
    print(
        combined.select(
            [
                "outcome",
                "pub_p",
                "rep_p",
                "p_diff",
                "pub_delta",
                "rep_delta",
                "delta_diff",
            ]
        )
    )
except Exception as e:
    print(f"Could not load replication data: {e}")
    print("Published values:")
    print(published)

Check Understanding

What is the difference between a case study and a controlled experiment? Give one example of a research question each is well-suited to answer.

A controlled experiment manipulates an independent variable to measure its causal effect: participants are assigned to conditions, everything else is held constant (or randomized), and the outcome is compared across conditions. A case study examines one situation in depth without manipulating anything: it observes, interviews, and documents. A controlled experiment is well-suited to a question like "does pair programming reduce defect rates in a two-hour coding task?" because you can assign developers randomly to work alone or in pairs. A case study is well-suited to "how did Mozilla's code review culture evolve over the first decade of Firefox development?" because you cannot randomize history, and the goal is understanding context rather than isolating a single variable.

The following scenario describes a pre-registration problem. What is wrong and how would proper pre-registration have prevented it?
# Pre-registration exercise: after running the analysis, the researcher writes:
"We hypothesized that TDD would produce higher-quality code,
 which is confirmed by our finding of p = 0.04."
# What is wrong with this as a pre-registration?

Writing down the hypothesis after seeing the results is not pre-registration; it is HARKing — Hypothesizing After Results are Known. A finding of p = 0.04 is consistent with both a genuine effect and one false positive in twenty tests. When the hypothesis is written after the data is examined, you cannot distinguish which situation you are in. Proper pre-registration would have required the researcher to file the hypothesis, the statistical test, and the significance threshold with a public registry before data collection began. That timestamp makes it impossible to claim a post-hoc observation as a predicted result.

Why does Furia et al. [Furia2023] argue that observational data in SE supports only correlation claims, not causal claims?

Observational data cannot establish causation because you cannot control for all the other variables that might explain the relationship. If you observe that projects with code review policies have fewer bugs, you cannot rule out that those projects also have senior developers, better test coverage, and more time for quality work — any of which could explain the lower bug count. Without randomization, you cannot separate the effect of code review from the effect of being the kind of organization that adopts code review. Furia et al. argue that SE researchers should be precise about this limit: observational findings support prediction (knowing X helps predict Y) but not intervention (changing X will change Y).

What is HARKing, and why does pre-registration prevent it?

HARKing stands for Hypothesizing After Results are Known. It happens when a researcher runs an analysis, notices a significant result, and then writes the paper as if that result was the predicted outcome all along. HARKing is not always deliberate fraud; researchers often genuinely convince themselves they had predicted what they found. The problem is that a HARKed hypothesis has not been tested at all — the data were used to generate it, so they cannot independently confirm it. Pre-registration prevents HARKing by requiring the researcher to commit to hypotheses and analysis plans before seeing the data; any deviation from the plan must be disclosed as exploratory, not confirmatory.

Exercises

Paper Critique (Pairs Exercise)

Work with a partner. Each pair receives a different short SE empirical paper. Identify the research question, the independent variable, the dependent variable, the sample size, and the statistical method used. Find one methodological strength that the authors handle well and one validity threat they do not acknowledge. Pick one reported statistic and check whether it is internally consistent — for example, whether the reported mean and standard deviation are plausible for the reported sample size. Present your findings to the class in two minutes.

Write a Pre-Registration

Write a two-paragraph pre-registration for the capstone study you will design in Lesson 18. The first paragraph must state your primary hypothesis precisely (naming independent and dependent variables), the statistical test you will use to evaluate it, and the minimum effect size you would consider practically meaningful. The second paragraph must describe your sample selection criteria, explain how you will handle missing data, and identify one analysis you will not run until after you have committed to these choices in writing.

Skills Needed to Replicate a Mining Study

Pizard et al. found that training students to replicate empirical SE studies made them better critics of new claims [Pizard2022]. Imagine you want to replicate a study that reports a Gini coefficient computed from mining software repositories. List three specific skills you would need to carry out that replication. For each skill, write one sentence explaining where in this tutorial you practiced it — cite the lesson number and the specific activity.

Causal Language in Observational Studies

Furia et al. distinguish between predictive models (X predicts Y) and causal models (X causes Y). Find one claim from a paper covered in this tutorial — or from any paper you have read — that uses causal language but is based on observational data. Quote the sentence exactly as it appears in the paper. Then rewrite the sentence to accurately reflect what the observational data actually support, using language about association or prediction rather than causation.

Pre-Registered Replication Plan for Bug-Contributor Findings

A finding discussed in the threats lesson states that files with many contributors tend to have more bugs [Bird2011]. Write a four-sentence pre-registered study plan to test whether this finding replicates in a new dataset: the first sentence states your null hypothesis precisely; the second sentence states your sample selection criteria including what counts as a "contributor" and what counts as a "bug"; the third sentence names the statistical test you will use and explains why that test is appropriate for this kind of data; the fourth sentence states the effect size threshold below which you would consider the effect too small to be practically meaningful, and explains why you chose that threshold.