Reading and Critiquing the Literature

Learning Goals

Finding Papers

Reading Order

Identifying Statistical Red Flags

The Gap Between Research and Practice

Check Understanding

What is wrong with a paper that reports N < 30 with a parametric test but no normality check? What should the authors have done instead?

Parametric tests like t-tests and ANOVA assume that the data come from a normally distributed population. With small samples, you cannot reliably verify this assumption from the data itself. Running the test anyway produces p-values and confidence intervals that may be badly wrong if the distribution is skewed or has heavy tails. The authors should either have reported a normality test (Shapiro-Wilk is standard for small samples) and justified using the parametric test if the result was non-significant, or used a non-parametric alternative such as the Mann-Whitney U test that does not require the normality assumption.

A paper reports mean = 15.2, standard deviation = 4.3, N = 50, and a 95% confidence interval of [14.0, 16.4]. Is this confidence interval correct? If not, what is the correct value?

The reported interval is wrong. The standard formula for a 95% confidence interval is approximately mean ± 1.96 × (standard deviation / sqrt(N)). Plugging in the numbers: 1.96 × (4.3 / sqrt(50)) = 1.96 × (4.3 / 7.07) = 1.96 × 0.608 = 1.19. The correct interval is approximately [15.2 − 1.19, 15.2 + 1.19] = [14.01, 16.39], which rounds to [14.0, 16.4]. In this case the reported interval happens to be correct, and this question is checking whether you can reproduce it. If a paper reported [14.5, 15.9] instead, that would be inconsistent with the stated standard deviation and sample size, and worth flagging.

What does it mean for a sample to be "self-selected"? Give an example from SE research where self-selection would distort results.

A self-selected sample is one where the people or projects in the study chose to participate rather than being randomly assigned or randomly drawn from a population. In SE research, a common example is a study of open-source project quality that uses only projects that opted into a code quality tool's analysis service. Projects that choose to use a quality tool are probably already more quality-conscious than projects that do not; the study therefore overestimates quality in the population of open-source projects as a whole. Any result from that study — "median technical debt is X" — does not generalize to projects that ignored the tool.

Why should you read a paper's conclusion before its methods section?

The conclusion tells you what the authors claim their study shows. Knowing the claim in advance lets you read the methods with a specific question in mind: does this method actually support that conclusion? If the conclusion makes a causal claim ("adopting linters reduces bugs"), you know to look for whether the methods involved randomization or only observation. If you read methods first, you absorb a lot of detail without knowing which parts are load-bearing for the final argument.

Lo et al. found that many practitioners do not read SE research. What would need to change for research to be more accessible and relevant to practitioners?

At minimum, venues would need to require plain-language summaries written for a non-specialist audience, and researchers would need incentives to produce them. The research questions themselves would need to connect more directly to decisions practitioners actually face, rather than being driven primarily by what is tractable with available data. Practitioners would need channels through which to communicate what they find confusing or irrelevant. None of this is technically difficult; it is a coordination and incentive problem. The most promising existing interventions are embedded researchers in industry and practitioner-facing publication tracks at major conferences.

Exercises

Paired Paper Review

In pairs, each pair receives a different short SE paper. Identify the research question, the independent variable, the dependent variable, the sample size, and the statistical method used. Find one thing the authors did well methodologically. Find one validity threat the authors did not discuss in their threats-to-validity section. Pick one reported statistic — a percentage, a mean, a Cohen's d — and check whether it is consistent with other numbers in the same paper. Present your findings in two minutes, ending with the sentence: "We reproduced / could not reproduce the claim that ___."

Threats Not Acknowledged

Pick any paper discussed in this tutorial that you did not use for the paired review exercise. Read its threats-to-validity section carefully. List every threat the authors acknowledge. Then list at least one threat they do not acknowledge. Write two sentences explaining why the unacknowledged threat matters for interpreting the paper's conclusions, and rate how serious you think it is on a scale from minor to fatal.

Plain-Language Summary

Write a one-paragraph plain-language summary of any one study from this tutorial, aimed at a software developer who has never read an academic paper. Your summary must state the main finding in one sentence that avoids statistical jargon, give the sample size in plain terms ("the researchers studied forty developers over six months"), and include one sentence explaining a reason to be cautious about generalizing the result to other contexts.

Verifying a Table Entry

Take any table of results from a paper covered in this tutorial. Identify one number in the table that you can compute from other numbers in the same paper — for example, a percentage derived from a count and a total, or a Cohen's d derived from two group means and standard deviations. Compute it yourself using the reported values. Report whether your computed value matches the published value. Write one sentence explaining what a mismatch would imply about the paper's reliability.

Evaluating a Preprint Claim

A preprint on arXiv claims that a new code review tool reduces review time by 40% based on a study of eight developers over two weeks. List four specific red flags in that claim. For each one, write one sentence explaining what additional information would let you decide whether the flag represents a serious problem or a minor limitation of an otherwise sound study.