The Logic of Hypothesis Testing

Learning Goals

Lesson

i
"""Two-sample t-test comparing Python and JavaScript line lengths."""

import polars as pl
from scipy import stats

py = pl.read_csv("data/py_line_lengths.csv")
js = pl.read_csv("data/js_line_lengths.csv")

py_medians = py.group_by("file_id").agg(pl.col("line_length").median())["line_length"]
js_medians = js.group_by("file_id").agg(pl.col("line_length").median())["line_length"]

result = stats.ttest_ind(py_medians.to_numpy(), js_medians.to_numpy())
print(f"t-statistic: {result.statistic:.2f}")
print(f"p-value: {result.pvalue:.2e}")
print(f"95% CI: {result.confidence_interval(0.95)}")

Check Understanding

What two common misinterpretations of the p-value should you always avoid?

The p-value is not the probability that the null hypothesis is true, and it is not the probability that your result is a fluke. It is the probability of observing data at least as extreme as yours, assuming the null hypothesis were true. Confusing these leads to conclusions like "there is only a 4% chance we are wrong," which is not what p = 0.04 means. The null hypothesis is either true or false; probability applies to your data given the null, not to the null given your data.

The following code runs 20 comparisons and reports the significant ones. What is wrong and how do you fix it?
results = []
for lang_pair in language_pairs:  # 20 pairs
    t, p = stats.ttest_ind(data[lang_pair[0]], data[lang_pair[1]])
    if p < 0.05:
        results.append((lang_pair, p))
print(f"Found {len(results)} significant differences!")

With 20 independent tests at a threshold of 0.05, you expect one false positive by chance alone even when no real differences exist. The code reports any result that crosses 0.05 as a finding, which inflates the false positive rate. Apply the Bonferroni correction by dividing the threshold by the number of tests:

threshold = 0.05 / len(language_pairs)
results = []
for lang_pair in language_pairs:
    t, p = stats.ttest_ind(data[lang_pair[0]], data[lang_pair[1]])
    if p < threshold:
        results.append((lang_pair, p))
print(f"Found {len(results)} significant differences (Bonferroni-corrected threshold: {threshold:.4f})")
Why does a very large sample size not automatically make a study more trustworthy?

With a large enough sample, a t-test will detect arbitrarily small differences as statistically significant. A study comparing two million lines of code might find that Python lines average 0.3 characters longer than JavaScript lines, with p < 0.0001, but a 0.3-character difference has no practical consequence for anyone. Statistical significance and practical importance are separate questions. Large samples also amplify small biases in data collection: if the Python files were sampled from different projects than the JavaScript files, that confound becomes a highly significant result.

A study finds p = 0.04 with N = 8. Should you trust this finding? What would make you more confident?

A p-value of 0.04 with only 8 observations should be treated with skepticism. With such a small sample, statistical power is low, meaning the study would often miss real effects. Paradoxically, the effects that do clear the significance bar in small studies tend to be large overestimates of the true effect. You should want to see a pre-registered replication with a larger sample (at least 30-50 per group for a medium effect), a confidence interval for the effect size, and a plausible causal mechanism. A single small study with p just below 0.05 is weak evidence.

Exercises

Reproduce the t-statistic

Run the code in ttest.py and verify that you get t = -269.67 and p ≈ 0.0. Then filter both datasets to keep only lines between 2 and 200 characters (excluding blank lines and unusually long auto-generated lines) and re-run the test. Report whether the finding is still statistically significant and whether the t-statistic changes substantially. If you ran both the full-data test and the filtered test, should you apply a Bonferroni correction? Explain your reasoning.

Bootstrapped t-statistics

Write a loop that draws a random sample of 500 rows from each of the Python and JavaScript datasets, runs a two-sample t-test, and records the t-statistic. Repeat this 20 times. Plot the distribution of the 20 t-statistics as a histogram using Altair. Describe in one sentence whether the t-statistic is stable across subsamples and what that stability (or instability) implies about the robustness of the original finding.

Confidence interval interpretation

Using the full Python and JavaScript datasets, compute the 95% confidence interval for the difference in mean line lengths between the two languages. Write a two-sentence interpretation that a software team lead (not a statistician) could act on. Then explain in one sentence why the confidence interval conveys more information than a bare p-value for this kind of decision.

Bonferroni in practice

A colleague tested all pairwise combinations of 6 programming languages (15 pairs total) and found 5 results significant at p < 0.05. Apply the Bonferroni correction and report how many of the 5 survive at the corrected threshold. Write one sentence explaining to the colleague why applying the correction is not optional when testing multiple hypotheses on the same dataset.

Pairwise test function

Write a function called pairwise_ttests that accepts a Polars dataframe with columns language and line_length and returns a Polars dataframe with one row per pair of languages, containing columns lang_a, lang_b, t_statistic, and p_value. Apply it to at least three language pairs from the line-length dataset and display the result sorted by p-value. Verify that the t-statistics match what you get when you call stats.ttest_ind directly on each pair.