Non-Parametric Methods and Rank-Based Tests

Learning Goals

When to Distrust Parametric Tests

What Brown and Altadmri Found

Spearman Rank Correlation

Kruskal-Wallis Test

Bootstrap Resampling

When to Bootstrap vs. When to Use a Parametric Test

Code

i
"""Bootstrap confidence interval for Spearman correlation between educator rankings and Blackbox data."""

import numpy as np
import polars as pl
from scipy import stats

df = pl.read_csv("data/educator_rankings.csv")
blackbox = df["blackbox_rank"].to_numpy()

# Compute Spearman r for each educator
educator_cols = [c for c in df.columns if c.startswith("educator_")]
spearman_rs = [
    stats.spearmanr(df[col].to_numpy(), blackbox).statistic for col in educator_cols
]
print(f"Median Spearman r across educators: {np.median(spearman_rs):.3f}")

# Bootstrap 95% CI for median Spearman r
rng = np.random.default_rng(42)
n = len(spearman_rs)
boot_medians = [
    np.median(rng.choice(spearman_rs, size=n, replace=True)) for _ in range(1000)
]
lo, hi = np.percentile(boot_medians, [2.5, 97.5])
print(f"95% bootstrap CI: [{lo:.3f}, {hi:.3f}]")

Check Understanding

What does Spearman rank correlation measure that Pearson correlation does not?

Spearman rank correlation measures whether the relationship between two variables is monotone — that is, whether one tends to increase as the other increases, regardless of whether that relationship is linear. Pearson correlation measures the strength of a linear relationship specifically. A dataset where one variable grows as the square of another would show a high Spearman r (because the monotone trend is strong) but possibly a lower Pearson r (because the relationship is not linear). For ranked lists like the educator and Blackbox rankings, Spearman is the natural choice because the concept of "twice as far apart in rank" is not well-defined.

The following bootstrap loop has a bug. What is wrong and how do you fix it?
boot_medians = []
for _ in range(1000):
    sample = np.random.choice(data, size=len(data))  # no replace=True
    boot_medians.append(np.median(sample))

The call to np.random.choice is missing replace=True. Without replacement, every resample is just a permutation of the original data and every median will be identical to the original median. The whole point of bootstrap resampling is to sample with replacement so that different values appear different numbers of times, which simulates the variability you would see across repeated studies. The fix:

boot_medians = []
rng = np.random.default_rng(42)
for _ in range(1000):
    sample = rng.choice(data, size=len(data), replace=True)
    boot_medians.append(np.median(sample))

Using np.random.default_rng rather than np.random.choice is also better practice because the new generator API is reproducible with an explicit seed and avoids global state.

Why is Kruskal-Wallis preferable to a one-way ANOVA for comparing error counts across three or more categories of programming mistakes?

One-way ANOVA assumes that observations within each group follow a normal distribution with equal variance across groups. Error counts are non-negative integers that are often heavily right-skewed — a few error types occur very frequently while most occur rarely. That distribution is not normal, and the variance across error categories is unlikely to be equal. Kruskal-Wallis makes neither assumption. It ranks all observations together and tests whether the average rank differs by group, which is a valid procedure regardless of the underlying distribution.

If the bootstrap 95% CI for median Spearman r across educators is [-0.1, 0.3], what does this tell you about whether educators predict student errors reliably?

A confidence interval that spans from negative to positive includes zero, which means the data are consistent with no relationship at all between educator rankings and the Blackbox data. The upper bound of 0.3 is a weak positive correlation even in the most optimistic reading. Taken together, this interval tells you that there is no reliable evidence that educators predict student error frequencies better than chance. Any individual educator who happens to have a positive r may simply be lucky; the uncertainty in the estimate is too large to conclude otherwise.

Exercises

Educators vs. Reality vs. Each Other

Test whether educators agree with each other more than they agree with reality. Compute the average Spearman r between all pairs of educators, and separately compute the average Spearman r between each educator and the Blackbox ranking. Report both averages. Write three sentences interpreting the comparison: what the relative magnitudes tell you about where educators' beliefs about novice mistakes come from, what this implies for the design of programming courses that aim to address the errors students most frequently make, and what additional data you would need to determine whether changing course content would actually reduce those errors.

Kruskal-Wallis on Error Categories

Run a Kruskal-Wallis test on error counts grouped by error category, using at least three distinct categories (syntax errors, type errors, and logic errors) from the educator rankings dataset. Report the test statistic and p-value. Write two sentences explaining your choice of Kruskal-Wallis over one-way ANOVA for this particular data — be specific about the properties of error count data that make the ANOVA assumptions implausible.

Bootstrap CI for the Best Educator

Identify the educator whose ranking has the highest Spearman r with the Blackbox data. Compute a bootstrap 95% confidence interval for that correlation using 1,000 bootstrap samples. Report the interval. Write one sentence interpreting whether even the best-performing educator's rankings are reliably better than chance, given the width and location of the interval.

Visualizing Bootstrap Uncertainty

Plot the distribution of the 1,000 bootstrap median Spearman r values as a histogram using Altair. Add two vertical rules marking the 2.5th and 97.5th percentiles, using a distinct color or stroke pattern for each. Write one sentence explaining what the width of the distribution tells you about uncertainty in your estimate of the median Spearman r — specifically, whether you would reach the same conclusion if you had surveyed a different set of educators.

Systematic Over-Prediction

Identify the three error types that educators most consistently over-predict relative to the Blackbox data — that is, the three types where educator rankings systematically place the error as more common than the Blackbox frequency data show. Write two sentences suggesting why educators might systematically over-estimate the frequency of those particular errors: consider what gets emphasized in introductory teaching materials, what kinds of errors instructors see and remember, and what kinds of errors students fix quickly and therefore stop making.