Observational Studies and Natural Experiments

In 2014, Eirini Kalliamvakou and her colleagues published a paper with a title that tells you everything you need to know: "The Promises and Perils of Mining GitHub" [Kalliamvakou2014]. They systematically documented the ways in which GitHub repository data is not a representative sample of software development—inactive repositories, personal experiments, class assignments, and mirrored projects all appear in the data alongside real production software. The paper did not say "don't use GitHub data." It said: be specific about what you are measuring, because the data does not mean what you think it means.

What Observational Studies Are

Mining Software Repositories

Bias Types

The Control-Validity Trade-off

Quasi-Experimental Designs

Natural Experiments

Grounded Theory

Field Study Design

Correlation and Causation

Misconceptions

A large enough sample turns correlation into causation.
Sample size increases the precision of an estimate, not its causal interpretation. An observational study with a million data points is still observational: you have not manipulated anything, and confounders remain.
GitHub data represents software development.
Open-source projects on GitHub are a small, self-selected slice of the software that actually gets built. Commercial software, government systems, and embedded code are largely invisible to MSR studies.
Including a confounder as a control variable in a regression removes its effect.
You can only control for confounders you have measured and measured accurately. Unmeasured confounders, and those measured with error, continue to bias your estimates even after "controlling for" them.
Natural experiments are as reliable as randomized controlled trials.
Natural experiments require assumptions—like parallel trends in difference-in-differences—that cannot be tested with the data available. They are stronger than raw observational studies but weaker than true randomization.

Check Understanding

What is the difference between an observational study and a controlled experiment?

In a controlled experiment, the researcher assigns participants to conditions (treatment and control) and manipulates the independent variable. This allows causal claims because randomization distributes confounding variables evenly. In an observational study, the researcher measures the world as it is without manipulation. Observational studies can detect correlations and, with care, support causal inference—but they require additional assumptions (like parallel trends in DiD) that experiments do not.

A study mines GitHub data and finds that projects with more frequent releases have fewer open issues. A blog post says "releasing frequently reduces issue backlogs." What is wrong with this interpretation?

The study shows a correlation, not causation. Several alternative explanations are equally consistent with the data: projects with better-resourced teams might both release more often and close issues faster (confounding); projects that close issues might release more often as a result (reverse causation); the projects might differ in type or maturity in ways that explain both patterns. Without randomization or a natural experiment, the causal direction cannot be established from this data.

What is survivorship bias, and why is it a particular concern when studying software project success using GitHub data?

Survivorship bias occurs when you only analyze cases that survived a selection process, missing the failed cases. On GitHub, abandoned projects, failed startups, and discontinued tools are either deleted or go silent. A study of "successful open-source projects" based on currently active repositories excludes all the projects that tried the same practices and failed. This makes successful practices look more predictive than they are.

A company introduces mandatory code review for Team A in January. Team B does not change its practices. Both teams' defect rates decline over the year. Explain how a difference-in-differences analysis would determine whether code review had an effect.

DiD compares the change in Team A's defect rate to the change in Team B's defect rate over the same period. If both teams declined by the same amount, that suggests the decline was due to factors affecting both teams (e.g., improved tooling, team learning). If Team A declined more than Team B, the excess decline is attributed to the code review intervention—under the assumption that without the intervention, both teams would have changed similarly (the parallel trends assumption).

Exercises

Spot the Bias (15 minutes)

Read the abstract and methods section of a paper that uses GitHub data to study software development practices (any paper from the MSR conference proceedings will work). Identify one specific example of selection bias, survivorship bias, or confounding that the paper either acknowledges or does not address. Write two sentences explaining how it affects the conclusions.

Design a Natural Experiment (20 minutes)

A large technology company is rolling out an AI coding assistant to its engineering teams over six months—some teams will get it first, others later, based on the order in which their managers signed up. Design an observational study that treats this rollout as a natural experiment. Specify: what you would measure, how you would use the staggered rollout as a source of variation, one threat to the parallel trends assumption, and one data source you would need access to.

Causal Diagram (15 minutes)

Draw a directed acyclic graph (DAG) for the following claim: "Using AI coding tools increases developer satisfaction, which reduces turnover." Add at least two plausible confounding variables. For each confounder, explain what you would need to control for in an observational analysis to isolate the effect of AI tool use on turnover.