Statistical Reference

This appendix summarizes the statistical concepts used in the lessons. It is not a statistics course. It is a reference for practitioners who need to decode a methods section without getting a degree first. Read it when a lesson references something you have forgotten, or use it to check whether a study you are evaluating applied the right tool.

Descriptive Statistics

The mean is the sum of values divided by the count; it is sensitive to outliers
The median is the middle value when sorted; use it when the distribution is skewed or when outliers are meaningful data points (e.g., response times, salaries)
The mode is the most frequent value; useful for categorical data
Variance is the average squared deviation from the mean; hard to interpret because it is in squared units
Standard deviation (SD) is the square root of variance; it is in the same units as the data
The interquartile range (IQR) is the distance between the 25th and 75th percentiles; more robust than SD when outliers are present
A box plot shows median, IQR, and outliers; it is more informative than a bar chart with error bars for most research purposes

Probability and Distributions

A probability distribution describes how likely each possible value is; discrete distributions apply to counts, continuous distributions to measurements
The normal distribution is symmetric and bell-shaped; it appears often because the central limit theorem says that averages of many independent random variables tend toward normal, regardless of the original distribution
Many things in software engineering are not normally distributed: task completion times, file sizes, defect counts, and developer productivity are typically right-skewed (long tail of large values)
When normality is violated, use non-parametric tests (Mann-Whitney U instead of t-test; Spearman's ρ instead of Pearson's r)
The distribution of a sample mean (the sampling distribution) is distinct from the distribution of individual observations; it has smaller spread (SD / √n) and is more nearly normal even when individual observations are not

Hypothesis Testing

The null hypothesis (H₀) is the claim that there is no effect; the alternative hypothesis (H₁) is that there is one
You test H₀ by asking: how likely is my data if H₀ were true?
A p-value is this probability; small p-values are evidence against H₀
Type I error (false positive, α): rejecting H₀ when it is true; conventional threshold α = 0.05
Type II error (false negative, β): failing to reject H₀ when H₁ is true; power = 1 − β
A confidence interval (CI) gives a range of values consistent with the data; a 95% CI means: if the study were repeated many times, 95% of intervals would contain the true value
- A CI that does not include zero is equivalent to p < 0.05 for a two-sided test at α = 0.05
- CIs are usually more informative than p-values alone because they show direction and magnitude, not just significance

Effect Sizes

Effect size measures the magnitude of an effect, independent of sample size
Cohen's d for comparing two means: d = (mean₁ − mean₂) / pooled SD
- d = 0.2: small; d = 0.5: medium; d = 0.8: large
- These benchmarks are rough guides from psychology; interpret in context
Odds ratio (OR): for binary outcomes, the ratio of odds in two groups; OR = 1 means no effect; OR > 1 means higher odds in the treatment group
Relative risk (RR): for binary outcomes, the ratio of proportions; more intuitive than OR but only valid for prospective studies
Pearson's r: correlation between two continuous variables; ranges from −1 to 1; r² is the proportion of variance in one variable explained by the other
Spearman's ρ (rho): rank-order correlation; use when the relationship is monotone but not linear, or when variables are ordinal

Statistical Power and Sample Size

Power is the probability of detecting a real effect; usually targeted at 0.80
Power increases with: larger sample size, larger true effect, higher α
The required sample size for a two-sample t-test is approximately: n = 2(z_α + z_β)²σ²/δ² where δ is the minimum detectable difference, σ is the population SD, z_α = 1.96 for α = 0.05, and z_β = 0.84 for 80% power
Most software engineering experiments are underpowered [Kampenes2007]: typical sample sizes of 20–50 participants can only detect large effects (d > 0.8) reliably
Post-hoc power analysis (calculating power after a null result) is generally uninformative and should be treated with skepticism

Multiple Comparisons

If you run k independent tests each at α = 0.05, the probability of at least one false positive is 1 − (0.95)^k
- k = 10: 40% chance of a false positive; k = 20: 64%
Bonferroni correction: use α/k as the threshold for each test; conservative (reduces power), but simple
False discovery rate (FDR, Benjamini-Hochberg): controls the expected proportion of false positives among rejected hypotheses; less conservative than Bonferroni for large numbers of tests
Exploratory analyses that test many hypotheses should be clearly labeled as exploratory; confirmatory claims require pre-registration or replication

Common Tests and When to Use Them

Test	When to use	Assumptions
Two-sample t-test	Compare means of two groups	Normality, approximately equal variance
Mann-Whitney U	Compare distributions of two groups	None (non-parametric)
Paired t-test	Compare means within pairs (before/after)	Paired differences are normal
Wilcoxon signed-rank	Paired comparison, non-parametric	Symmetric differences
Chi-square	Compare frequencies or proportions	Expected count ≥ 5 per cell
One-way ANOVA	Compare means across 3+ groups	Normality, equal variance
Kruskal-Wallis	Non-parametric ANOVA	None
Pearson correlation	Linear relationship between two continuous vars	Bivariate normality
Spearman correlation	Monotone relationship; ordinal or skewed data	None
Linear regression	Model continuous outcome from predictors	Linearity, normality of residuals, homoscedasticity
Logistic regression	Model binary outcome	Logistic relationship

Correlation vs. Causation

Correlation measures association; causation is a claim about mechanism
Three alternative explanations for a correlation between A and B:
- A causes B
- B causes A (reverse causation)
- C causes both A and B (confounding)
Randomization is the mechanism that makes causal claims defensible: random assignment distributes confounders evenly across conditions
Observational studies require additional assumptions to support causal claims (parallel trends for DiD, exclusion restriction for instrumental variables, no unmeasured confounders for regression)
Directed acyclic graphs (DAGs) help identify which variables need to be controlled and which should not be (controlling for a mediator blocks the causal path you want to estimate)