Confounds, Bias, and Threats to Validity

Learning Goals

Lesson

i
"""Apply Kalliamvakou criteria to classify GitHub repositories."""

import polars as pl
from scipy import stats
import datetime

df = pl.read_csv("data/github_repos.csv")
cutoff = (datetime.date.today() - datetime.timedelta(days=730)).isoformat()

not_real = df.filter(
    (pl.col("commits") < 5)
    | (pl.col("contributors") == 1)
    | (pl.col("last_commit") < cutoff)
)
print(
    f"Repositories failing Kalliamvakou criteria: {len(not_real)} / {len(df)} = {len(not_real) / len(df):.1%}"
)

r, p = stats.pearsonr(df["stars"].to_numpy(), df["commits"].to_numpy())
print(f"\nPearson r (stars vs. commits): {r:.3f}, p = {p:.2e}")

Check Understanding

What is the difference between a confound and selection bias? Give one example of each from SE research.

A confound is a variable that affects both the cause and the effect in a study, creating a spurious relationship between them. For example, in the Bird et al. study, file size affects both the number of contributors (big files attract more contributors) and the defect rate (big files have more defects), so any relationship between contributors and defects could be explained by size alone.

Selection bias occurs when the sample studied is not representative of the population of interest because of how participants or cases were selected. For example, studying only active GitHub repositories biases toward projects that already succeeded, omitting the failed projects that might reveal what causes failure.

The following code tries to filter repositories whose last commit was more than two years ago, but it does not work correctly. What is the bug and how do you fix it?
cutoff = datetime.datetime.today() - datetime.timedelta(730)
old = df.filter(pl.col("last_commit") < cutoff)
# last_commit column contains strings like "2023-04-15"

There are two bugs. First, datetime.datetime.today() returns a datetime object, but last_commit contains ISO date strings. Comparing a string to a datetime object will either raise an error or produce incorrect results depending on the comparison method. Second, subtracting a timedelta from a datetime gives another datetime; you need an ISO format string to compare against the string column. The fix is:

cutoff = (datetime.date.today() - datetime.timedelta(days=730)).isoformat()
old = df.filter(pl.col("last_commit") < cutoff)

This produces a string like "2024-05-16" that sorts lexicographically in the same order as chronologically, so the string comparison is correct.

Why does studying only active GitHub repositories introduce survivorship bias?

Survivorship bias occurs because the repositories you can observe are the ones that survived long enough to still exist and still be active. Projects that were abandoned early, deleted by their owners, or never gained any traction are absent from the sample. If you want to understand what makes projects succeed, you are missing the very data that would contrast success with failure. The effect is that your conclusions describe what active projects look like, not what causes a project to become active.

A paper finds that repositories with more GitHub stars have higher code quality as measured by static analysis. Why can't you conclude that gaining stars causes quality to improve?

Correlation between stars and quality is consistent with at least three explanations. Stars might cause quality if developers clean up their code after it becomes popular. Quality might cause stars if well-written code attracts users. Or a third variable, such as the reputation and skill of the original author, might cause both high quality and high star counts independently. Without an experiment or a careful causal design, you cannot distinguish between these explanations. The correlation is real; the direction and mechanism are unknown.

Exercises

Threats to validity in a published claim

In a small group, review the following hypothetical finding: "Teams that conduct daily standups ship features 30% faster, based on a voluntary survey of 200 developers at a single company." Identify three specific threats to validity in this study design, name each threat using the vocabulary from this lesson, and describe one concrete change to the study design that would address each threat. For each change, estimate in one sentence how difficult that change would be to implement in practice.

Stars as a quality signal

He et al. found that millions of GitHub stars appear to be fake. Using the 200-repository sample, compute Pearson r between star counts and commit counts. Write one sentence interpreting the correlation you find. Write a second sentence explaining why a high correlation between stars and commits still does not make star count a reliable signal of project quality, given what He et al. found.

Controlling for a confound

Bird et al. controlled for file size when studying the relationship between ownership and defect rates. In the 200-repository sample, compute Pearson r between the number of commits and the number of contributors. Write two sentences explaining why this correlation matters when any study claims that either variable affects code quality. Then describe in one sentence the statistical approach you would use to study the effect of commits while controlling for contributors.

From correlation to causal claim

Furia et al. argue that observational data supports only correlational claims, not causal ones. Pick any finding from the first week of the course (for example, Python files tend to be longer than JavaScript files). Write one sentence stating the correlation. Write a second sentence proposing a plausible confound that could explain the correlation without any causal relationship between language choice and file length. Write a third sentence describing the study design you would need to rule out that confound.

Activity and star counts

Ait et al. found that many GitHub projects go inactive shortly after creation. In the 200-repository sample, classify each repository as active (last commit within 12 months) or inactive (last commit more than 12 months ago). Compute the mean star count for each group. Report whether active repositories have more stars on average. Write two sentences interpreting the result and explaining whether the difference, if any, implies that star count is a useful predictor of project activity.