Statistical Reference
This appendix summarizes the statistical concepts used in the lessons. It is not a statistics course. It is a reference for practitioners who need to decode a methods section without getting a degree first. Read it when a lesson references something you have forgotten, or use it to check whether a study you are evaluating applied the right tool.
Descriptive Statistics
- The mean is the sum of values divided by the count; it is sensitive to outliers
- The median is the middle value when sorted; use it when the distribution is skewed or when outliers are meaningful data points (e.g., response times, salaries)
- The mode is the most frequent value; useful for categorical data
- Variance is the average squared deviation from the mean; hard to interpret because it is in squared units
- Standard deviation (SD) is the square root of variance; it is in the same units as the data
- The interquartile range (IQR) is the distance between the 25th and 75th percentiles; more robust than SD when outliers are present
- A box plot shows median, IQR, and outliers; it is more informative than a bar chart with error bars for most research purposes
Probability and Distributions
- A probability distribution describes how likely each possible value is; discrete distributions apply to counts, continuous distributions to measurements
- The normal distribution is symmetric and bell-shaped; it appears often because the central limit theorem says that averages of many independent random variables tend toward normal, regardless of the original distribution
- Many things in software engineering are not normally distributed: task completion times, file sizes, defect counts, and developer productivity are typically right-skewed (long tail of large values)
- When normality is violated, use non-parametric tests (Mann-Whitney U instead of t-test; Spearman's ρ instead of Pearson's r)
- The distribution of a sample mean (the sampling distribution) is distinct from the distribution of individual observations; it has smaller spread (SD / √n) and is more nearly normal even when individual observations are not
Hypothesis Testing
- The null hypothesis (H₀) is the claim that there is no effect; the alternative hypothesis (H₁) is that there is one
- You test H₀ by asking: how likely is my data if H₀ were true?
- A p-value is this probability; small p-values are evidence against H₀
- Type I error (false positive, α): rejecting H₀ when it is true; conventional threshold α = 0.05
- Type II error (false negative, β): failing to reject H₀ when H₁ is true; power = 1 − β
- A confidence interval (CI) gives a range of values consistent with
the data; a 95% CI means: if the study were repeated many times, 95%
of intervals would contain the true value
- A CI that does not include zero is equivalent to p < 0.05 for a two-sided test at α = 0.05
- CIs are usually more informative than p-values alone because they show direction and magnitude, not just significance
Effect Sizes
- Effect size measures the magnitude of an effect, independent of sample size
- Cohen's d for comparing two means:
d = (mean₁ − mean₂) / pooled SD
- d = 0.2: small; d = 0.5: medium; d = 0.8: large
- These benchmarks are rough guides from psychology; interpret in context
- Odds ratio (OR): for binary outcomes, the ratio of odds in two groups; OR = 1 means no effect; OR > 1 means higher odds in the treatment group
- Relative risk (RR): for binary outcomes, the ratio of proportions; more intuitive than OR but only valid for prospective studies
- Pearson's r: correlation between two continuous variables; ranges from −1 to 1; r² is the proportion of variance in one variable explained by the other
- Spearman's ρ (rho): rank-order correlation; use when the relationship is monotone but not linear, or when variables are ordinal
Statistical Power and Sample Size
- Power is the probability of detecting a real effect; usually targeted at 0.80
- Power increases with: larger sample size, larger true effect, higher α
- The required sample size for a two-sample t-test is approximately: n = 2(z_α + z_β)²σ²/δ² where δ is the minimum detectable difference, σ is the population SD, z_α = 1.96 for α = 0.05, and z_β = 0.84 for 80% power
- Most software engineering experiments are underpowered [Kampenes2007]: typical sample sizes of 20–50 participants can only detect large effects (d > 0.8) reliably
- Post-hoc power analysis (calculating power after a null result) is generally uninformative and should be treated with skepticism
Multiple Comparisons
- If you run k independent tests each at α = 0.05, the probability of
at least one false positive is 1 − (0.95)^k
- k = 10: 40% chance of a false positive; k = 20: 64%
- Bonferroni correction: use α/k as the threshold for each test; conservative (reduces power), but simple
- False discovery rate (FDR, Benjamini-Hochberg): controls the expected proportion of false positives among rejected hypotheses; less conservative than Bonferroni for large numbers of tests
- Exploratory analyses that test many hypotheses should be clearly labeled as exploratory; confirmatory claims require pre-registration or replication
Common Tests and When to Use Them
| Test | When to use | Assumptions |
|---|---|---|
| Two-sample t-test | Compare means of two groups | Normality, approximately equal variance |
| Mann-Whitney U | Compare distributions of two groups | None (non-parametric) |
| Paired t-test | Compare means within pairs (before/after) | Paired differences are normal |
| Wilcoxon signed-rank | Paired comparison, non-parametric | Symmetric differences |
| Chi-square | Compare frequencies or proportions | Expected count ≥ 5 per cell |
| One-way ANOVA | Compare means across 3+ groups | Normality, equal variance |
| Kruskal-Wallis | Non-parametric ANOVA | None |
| Pearson correlation | Linear relationship between two continuous vars | Bivariate normality |
| Spearman correlation | Monotone relationship; ordinal or skewed data | None |
| Linear regression | Model continuous outcome from predictors | Linearity, normality of residuals, homoscedasticity |
| Logistic regression | Model binary outcome | Logistic relationship |
Correlation vs. Causation
- Correlation measures association; causation is a claim about mechanism
- Three alternative explanations for a correlation between A and B:
- A causes B
- B causes A (reverse causation)
- C causes both A and B (confounding)
- Randomization is the mechanism that makes causal claims defensible: random assignment distributes confounders evenly across conditions
- Observational studies require additional assumptions to support causal claims (parallel trends for DiD, exclusion restriction for instrumental variables, no unmeasured confounders for regression)
- Directed acyclic graphs (DAGs) help identify which variables need to be controlled and which should not be (controlling for a mediator blocks the causal path you want to estimate)