Effect Size and Practical Significance

Learning Goals

Lesson

i
"""Cohen's d and Cliff's delta for Fucci TDD study outcomes."""

import numpy as np
import polars as pl


def cohens_d(a, b):
    """Compute Cohen's d for two arrays."""
    pooled_std = np.sqrt((a.std() ** 2 + b.std() ** 2) / 2)
    return (a.mean() - b.mean()) / pooled_std


def cliffs_delta(a, b):
    """Compute Cliff's delta for two arrays."""
    n = len(a) * len(b)
    greater = sum(1 for x in a for y in b if x > y)
    less = sum(1 for x in a for y in b if x < y)
    return (greater - less) / n


df = pl.read_csv("data/fucci_tdd.csv")
tdd = df.filter(pl.col("approach") == "TDD")
tld = df.filter(pl.col("approach") == "TLD")

for outcome in ["PROD", "QLTY", "TESTS"]:
    a = tdd[outcome].drop_nulls().to_numpy()
    b = tld[outcome].drop_nulls().to_numpy()
    d = cohens_d(a, b)
    delta = cliffs_delta(a, b)
    print(f"{outcome}: Cohen's d = {d:.3f}, Cliff's delta = {delta:.3f}")

Check Understanding

A study finds p = 0.0001 with N = 500,000. Does this mean the effect is large? Explain.

No. With half a million observations, a t-test will detect effects so small they are meaningless in practice. A difference of 0.1 lines of code per function, or 0.002 extra defects per month, can produce p = 0.0001 at that sample size. The p-value tells you that the observed difference is unlikely to be due to chance, not that the difference is large enough to care about. You need an effect size (Cohen's d, Cliff's delta, or a confidence interval for the difference) to judge practical significance.

The following function computes something, but it is not Cohen's d. What is wrong and how do you fix it?
def cohens_d(group_a, group_b):
    return (group_a.mean() - group_b.mean()) / group_a.std()

The function divides by the standard deviation of group A alone, rather than the pooled standard deviation of both groups. If group A has a much smaller spread than group B, this inflates d; if it has a larger spread, it deflates d. The pooled standard deviation accounts for variability in both groups symmetrically, so the result does not change depending on which group you label A. The fix is:

def cohens_d(group_a, group_b):
    pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2)
    return (group_a.mean() - group_b.mean()) / pooled_std
What does a CLES of 0.55 mean in plain language?

If you picked one person at random from group A and one person at random from group B, the person from group A would have the higher value 55% of the time. That is only slightly better than a coin flip. While the difference between the groups may be statistically significant, a CLES of 0.55 means the groups overlap so much that knowing which group someone is in barely helps you predict their outcome. A CLES near 0.5 is one good way to communicate a small practical effect to a non-technical audience.

Why is Cliff's delta preferable to Cohen's d for ordinal outcome variables?

Cohen's d is computed from means and standard deviations, which assume the data is at least interval-level (i.e., that equal differences in the scale represent equal real-world differences). Ordinal data does not satisfy this assumption: the difference between "strongly agree" and "agree" is not necessarily the same as the difference between "agree" and "neutral." Cliff's delta works on ranks, so it only requires that you can order the values, not that the distances between them are meaningful. It is also more robust to skewed distributions and outliers, which are common in SE survey data.

Exercises

Cohen's d for Working Hours

Using the weekday and weekend programmer-hours data from the previous lesson, compute Cohen's d for the difference in hours worked. The t-test found p ≈ 10⁻³¹, which is about as small as p-values get. What is d? Based on Cohen's rough guidelines, how would you characterize the effect? Write two sentences directed at a manager considering a policy of discouraging weekend work: one based on the p-value and one based on Cohen's d, and explain which sentence is more useful for making that decision.

CLES for Working Hours

Compute the CLES for the weekday versus weekend programmer-hours comparison. Write exactly one sentence interpreting the result in plain English, using language that a project manager who has never taken a statistics course would understand. Then write one sentence explaining why CLES is easier to communicate to that audience than Cohen's d.

Small Effect, No Significance

Construct a synthetic dataset in Python where Cohen's d is approximately 0.8 (a large effect by Cohen's guidelines) but the t-test produces p > 0.05. Report the sample size you needed to make this happen. Explain in one sentence what this demonstrates about the relationship between effect size and statistical significance.

Effect Size for Prechelt Data

From the Prechelt data analyzed in the first lesson, pick one statistically significant p-value you found earlier (e.g., a comparison between two languages on work hours). Compute Cohen's d for that comparison. Write one sentence describing whether the effect size strengthens or weakens the claim that the difference is practically important for a team deciding which language to use.

Cliff's Delta Table for TDD Outcomes

Using the Fucci TDD dataset, compute Cliff's delta for all three outcome variables (TESTS, QLTY, and PROD). Present the results as a Polars dataframe with columns outcome, cliffs_delta, and label, where label is one of "negligible" (|delta| < 0.147), "small" (< 0.33), "medium" (< 0.474), or "large" (≥ 0.474). Write one sentence summarizing what the table implies for a development team considering whether to adopt TDD.