Comparing Two Groups

Learning Goals

Lesson

i
"""Compare weekday vs. weekend programmer working hours."""

import polars as pl
from scipy import stats

df = pl.read_csv("data/programmer_hours.csv")
weekday = df.filter(pl.col("day_type") == "weekday")["hours"].to_numpy()
weekend = df.filter(pl.col("day_type") == "weekend")["hours"].to_numpy()

print(f"Weekday mean: {weekday.mean():.1f} hours")
print(f"Weekend mean: {weekend.mean():.1f} hours")

t_result = stats.ttest_ind(weekday, weekend)
print(f"\nt-test: t = {t_result.statistic:.1f}, p = {t_result.pvalue:.2e}")

mw_result = stats.mannwhitneyu(weekday, weekend, alternative="two-sided")
print(f"Mann-Whitney U: U = {mw_result.statistic:.0f}, p = {mw_result.pvalue:.2e}")

Check Understanding

When should you use Mann-Whitney U instead of Student's t-test?

Use Mann-Whitney U when the data in one or both groups is not approximately normally distributed. In practice, this means skewed distributions, heavy tails, ordinal data, or any bounded measurement like hours worked or bug counts. SE data rarely follows a normal distribution, so Mann-Whitney is a reasonable default. You should also use it when sample sizes are small and you cannot verify normality, since the t-test is not robust to non-normality with small N.

The following code is supposed to filter out weekend days, but it contains a bug. What is wrong and how do you fix it?
weekday = df.filter(pl.col("day") != "Sat" and pl.col("day") != "Sun")

The and keyword in Python evaluates two boolean objects with Python's truthiness rules, not element-wise on Polars Series. With Polars, boolean conditions on columns must be combined with & (the bitwise and operator), not and. Using and here will raise an error or produce unexpected behavior. The fix is:

weekday = df.filter((pl.col("day") != "Sat") & (pl.col("day") != "Sun"))

Each condition must also be wrapped in parentheses because & has lower precedence than !=.

A Shapiro-Wilk test on 10,000 observations gives p = 0.001. Does this mean you must use Mann-Whitney? Explain.

Not necessarily. With 10,000 observations, the Shapiro-Wilk test is extremely sensitive and will reject normality for departures so small that they have no practical effect on the validity of a t-test. The t-test is robust to mild non-normality when sample sizes are large, because the central limit theorem ensures that sample means are approximately normally distributed regardless of the underlying distribution. The right response is to examine a QQ plot: if the points fall close to the diagonal with only minor deviations in the tails, the t-test is likely fine. If the QQ plot shows severe skewness or heavy tails, switch to Mann-Whitney.

What does a QQ plot with points curving away from the diagonal line in the tails indicate?

It indicates that the tails of the distribution are heavier than a normal distribution would produce. The extreme high and low values occur more frequently than normality predicts. This is common in SE data: a few files have thousands of lines while most have dozens. A t-test on such data puts substantial weight on those extreme values, which can distort the result. Mann-Whitney is more appropriate because it works on ranks rather than raw values, so it is not influenced by how extreme the extremes are.

Exercises

Normality Check for Hours Data

Run the code in weekday_weekend.py to reproduce the t-statistic (t ≈ 12.8, p ≈ 6.9 × 10⁻³¹). Then run the Shapiro-Wilk test on each group separately using scipy.stats.shapiro. Report the W statistic and p-value for each group. Given those results, was the t-test appropriate? Does switching to Mann-Whitney U change the conclusion, and if so, in which direction?

Choosing a Test Without Peeking

Suppose you are analyzing the Fucci et al. sleep deprivation data [Fucci2018], which has two groups of roughly 22 students each. Describe in plain language the exact sequence of steps you would take to decide whether to use a t-test or Mann-Whitney U test, without looking at p-values from either test first. Then write the Python code for that decision process: load the data, plot a QQ plot for each group, run Shapiro-Wilk, and print a recommendation.

Common-Language Effect Size

The common-language effect size (CLES) is the probability that a randomly chosen weekday observation is larger than a randomly chosen weekend observation. Compute it as the Mann-Whitney U statistic divided by the product of the two group sizes (n1 × n2). Write one sentence interpreting the result in plain language that a manager who has never taken a statistics course would understand.

QQ Plots for Both Groups

Generate QQ plots for the weekday and weekend groups using scipy.stats.probplot and display them side by side. For each plot, describe in one sentence what the shape of the points tells you about the distribution. Based on those descriptions, write a one-sentence recommendation about which test is more appropriate for this data.

Four-Group Comparison

Split the programmer-hours data into four groups: Monday through Wednesday, Thursday and Friday, Saturday, and Sunday. Run the Kruskal-Wallis test using scipy.stats.kruskal and report the H statistic and p-value. Write one sentence explaining why you should use Kruskal-Wallis rather than running three separate Mann-Whitney U tests between the groups, and what mistake the separate-tests approach would make.