Comparing Two Groups
Learning Goals
- Choose between Student's t-test and Mann-Whitney U based on distribution shape
- Check normality using QQ plots and the Shapiro-Wilk test
- Interpret and report both test statistics appropriately
- Recognize when a QQ plot reveals departure from normality
Lesson
- The previous lesson introduced hypothesis testing; this one asks which test to use
- The answer depends on the shape of your data, not on what answer you want to get
- Student's t-test compares the means of two groups and assumes each group is approximately normally distributed
- Use
scipy.stats.ttest_indfor independent groups - When p rounds to 0.0, report it in scientific notation (e.g., p = 6.9 × 10⁻³¹); never write "p = 0"
- The t-statistic measures how many standard errors separate the two means
- Use
- Mann-Whitney U (also called the Wilcoxon rank-sum test) compares two groups without assuming normality
- It tests whether values from one group tend to be larger than values from the other
- Use
scipy.stats.mannwhitneyuwithalternative="two-sided"unless you have a strong directional prediction - SE data is rarely normally distributed, so Mann-Whitney is usually the safer default
- How to choose between the two: look at your data first
- If the distribution is roughly bell-shaped in each group, either test works
- If the distribution is skewed, has heavy tails, or is bounded (like hours worked or bug counts), use Mann-Whitney
- Never choose the test after looking at the p-values from both
- Checking normality with a QQ plot
- A QQ plot plots each quantile of your data against the corresponding quantile of a normal distribution
- Points that fall on a straight diagonal line indicate the data is approximately normal
- Points that curve away from the diagonal in the tails indicate heavier or lighter tails than normal
- An S-shaped curve indicates skewness
- The Shapiro-Wilk test is a formal normality test, but use it with caution
- For small samples (N < 50), it has low power and may miss non-normality
- For large samples (N > 5,000), it almost always rejects normality because it detects tiny departures
- At large N, a "failed" Shapiro-Wilk test does not mean you must use Mann-Whitney; check the QQ plot too
- Fucci et al. ran a quasi-experiment with 45 undergraduate students [Fucci2018]
- 23 students stayed awake all night before a programming task; 22 slept normally
- The sleep-deprived group produced lower-quality implementations
- This design controls for exactly one variable (sleep), which makes causal interpretation much more defensible
- Small N makes the choice of test consequential: with 22 or 23 per group, normality is hard to verify
- The code below loads programmer-hours data, splits it by day type, and runs both tests
"""Compare weekday vs. weekend programmer working hours."""
import polars as pl
from scipy import stats
df = pl.read_csv("data/programmer_hours.csv")
weekday = df.filter(pl.col("day_type") == "weekday")["hours"].to_numpy()
weekend = df.filter(pl.col("day_type") == "weekend")["hours"].to_numpy()
print(f"Weekday mean: {weekday.mean():.1f} hours")
print(f"Weekend mean: {weekend.mean():.1f} hours")
t_result = stats.ttest_ind(weekday, weekend)
print(f"\nt-test: t = {t_result.statistic:.1f}, p = {t_result.pvalue:.2e}")
mw_result = stats.mannwhitneyu(weekday, weekend, alternative="two-sided")
print(f"Mann-Whitney U: U = {mw_result.statistic:.0f}, p = {mw_result.pvalue:.2e}")
- Weekday mean ≈ 6.8 hours, weekend mean ≈ 3.2 hours; t ≈ 12.8, p ≈ 6.9 × 10⁻³¹
- The two tests should give similar conclusions when the sample is large and the difference is real
- When they disagree, that disagreement is itself informative: it suggests the normality assumption matters
Check Understanding
When should you use Mann-Whitney U instead of Student's t-test?
Use Mann-Whitney U when the data in one or both groups is not approximately normally distributed. In practice, this means skewed distributions, heavy tails, ordinal data, or any bounded measurement like hours worked or bug counts. SE data rarely follows a normal distribution, so Mann-Whitney is a reasonable default. You should also use it when sample sizes are small and you cannot verify normality, since the t-test is not robust to non-normality with small N.
The following code is supposed to filter out weekend days, but it contains a bug. What is wrong and how do you fix it?
weekday = df.filter(pl.col("day") != "Sat" and pl.col("day") != "Sun")
weekday = df.filter(pl.col("day") != "Sat" and pl.col("day") != "Sun")
The and keyword in Python evaluates two boolean objects with Python's truthiness rules,
not element-wise on Polars Series. With Polars, boolean conditions on columns must be
combined with & (the bitwise and operator), not and. Using and here will raise an
error or produce unexpected behavior. The fix is:
weekday = df.filter((pl.col("day") != "Sat") & (pl.col("day") != "Sun"))
Each condition must also be wrapped in parentheses because & has lower precedence
than !=.
A Shapiro-Wilk test on 10,000 observations gives p = 0.001. Does this mean you must use Mann-Whitney? Explain.
Not necessarily. With 10,000 observations, the Shapiro-Wilk test is extremely sensitive and will reject normality for departures so small that they have no practical effect on the validity of a t-test. The t-test is robust to mild non-normality when sample sizes are large, because the central limit theorem ensures that sample means are approximately normally distributed regardless of the underlying distribution. The right response is to examine a QQ plot: if the points fall close to the diagonal with only minor deviations in the tails, the t-test is likely fine. If the QQ plot shows severe skewness or heavy tails, switch to Mann-Whitney.
What does a QQ plot with points curving away from the diagonal line in the tails indicate?
It indicates that the tails of the distribution are heavier than a normal distribution would produce. The extreme high and low values occur more frequently than normality predicts. This is common in SE data: a few files have thousands of lines while most have dozens. A t-test on such data puts substantial weight on those extreme values, which can distort the result. Mann-Whitney is more appropriate because it works on ranks rather than raw values, so it is not influenced by how extreme the extremes are.
Exercises
Normality Check for Hours Data
Run the code in weekday_weekend.py to reproduce the t-statistic (t ≈ 12.8, p ≈ 6.9 × 10⁻³¹).
Then run the Shapiro-Wilk test on each group separately using scipy.stats.shapiro.
Report the W statistic and p-value for each group. Given those results, was the t-test
appropriate? Does switching to Mann-Whitney U change the conclusion, and if so, in which
direction?
Choosing a Test Without Peeking
Suppose you are analyzing the Fucci et al. sleep deprivation data [Fucci2018], which has two groups of roughly 22 students each. Describe in plain language the exact sequence of steps you would take to decide whether to use a t-test or Mann-Whitney U test, without looking at p-values from either test first. Then write the Python code for that decision process: load the data, plot a QQ plot for each group, run Shapiro-Wilk, and print a recommendation.
Common-Language Effect Size
The common-language effect size (CLES) is the probability that a randomly chosen weekday observation is larger than a randomly chosen weekend observation. Compute it as the Mann-Whitney U statistic divided by the product of the two group sizes (n1 × n2). Write one sentence interpreting the result in plain language that a manager who has never taken a statistics course would understand.
QQ Plots for Both Groups
Generate QQ plots for the weekday and weekend groups using scipy.stats.probplot and
display them side by side. For each plot, describe in one sentence what the shape of the
points tells you about the distribution. Based on those descriptions, write a one-sentence
recommendation about which test is more appropriate for this data.
Four-Group Comparison
Split the programmer-hours data into four groups: Monday through Wednesday, Thursday
and Friday, Saturday, and Sunday. Run the Kruskal-Wallis test using scipy.stats.kruskal
and report the H statistic and p-value. Write one sentence explaining why you should use
Kruskal-Wallis rather than running three separate Mann-Whitney U tests between the groups,
and what mistake the separate-tests approach would make.