Reading Studies Critically

In 2023, McKinsey published "Yes, You Can Measure Software Developer Productivity" [McKinsey2023]. It proposed a framework involving dozens of metrics across multiple dimensions and was widely cited by managers and executives as justification for measuring individual developer output. Kent Beck, one of the authors of the Agile Manifesto, wrote a pointed response: the paper treated productivity as an objective fact to be measured when it is actually a judgment about what matters, made by people with interests [Beck2023]. Researchers in the field found the paper's methodology thin and its claims overclaimed. The paper was not obviously fraudulent—it was something more insidious: confident, plausible, and wrong in ways that non-specialists could not easily detect.

This session gives you the tools to be that specialist.

How to Read a Research Paper

HARKing

p-Hacking

Publication Bias

Conflicts of Interest

Operationalization as a Site of Flexibility

A Checklist for Evaluating an Empirical SE Study

When the Outcome Measure Is a Dashboard

Misconceptions

Peer review detects p-hacking and HARKing.
Reviewers see the final manuscript, not the analysis history. Without access to pre-registered protocols or raw data, they cannot distinguish a cleanly hypothesized analysis from one that was tidied up after the fact.
Pre-registration guarantees that a study's results are valid.
It reduces one class of problem—undisclosed flexibility in analysis—by making deviations visible. It does not fix flawed designs, bad measurements, unrepresentative samples, or errors in execution.
Industry-funded research is always biased toward favorable results.
"Always" is too strong: some industry-funded studies are rigorous and honest. The concern is systematic, not universal: on average, funding source predicts outcome in ways that independent replication often does not support.
A well-cited paper is probably right. Citation counts reflect influence, not accuracy.
Papers get cited because they tell a useful or interesting story, whether or not the story holds up to scrutiny. Some of the most-cited results in software engineering have failed to replicate.

Check Understanding

What is the difference between p-hacking and HARKing? Can a study engage in both?

p-hacking is running many analyses and selectively reporting the one that gives p < 0.05. HARKing is presenting exploratory findings as if they were predicted hypotheses. A study can engage in both: the researcher explores the data (p-hacking), finds a pattern, then writes the paper as if that pattern was hypothesized in advance (HARKing). Both inflate false positive rates and make the literature look more certain than it is.

A paper reports 12 comparisons across different subgroups and finds that one is significant at p = 0.04. Should you conclude there is a real effect? Why or why not?

No. With 12 independent comparisons at α = 0.05, you expect 0.6 false positives by chance. A single significant result from 12 comparisons is entirely consistent with chance—and with p = 0.04 (just barely significant), it is especially suspicious. The appropriate response is to apply a multiple comparisons correction (such as Bonferroni, which would require p < 0.004 for any individual test) and treat the result as preliminary until replicated.

What is the file drawer problem and how does it affect the credibility of published research?

The file drawer problem refers to the tendency for null results—studies that found no significant effect—to go unpublished (they stay "in the file drawer"). Because journals prefer positive results, researchers are less likely to submit null findings. The consequence is that the published literature overrepresents positive findings: if you read all studies on a topic, you will see more support for effects than actually exists in the world. Meta-analyses that include only published studies inherit this bias.

The following abstract contains a claim that is not supported by typical RCT design. Identify and explain the problem: "Our randomized controlled trial shows that developers using our tool are more productive. We therefore recommend that all engineering teams adopt it."

An RCT establishes that the treatment caused an outcome in the studied population under the studied conditions. "More productive" in an RCT typically means "faster at the specific task used in the study." The recommendation to "all engineering teams" is an external validity claim: that the effect generalizes beyond the study's sample (often students or volunteers at one organization), task (often artificial or narrow), and context. Most SE experiments cannot support broad adoption recommendations because their external validity is limited.

Exercises

Apply the Checklist (20 minutes)

Apply the nine-item checklist from this lesson to Peng et al. 2023 (the GitHub Copilot study) or to the McKinsey productivity paper. For each item, give a one-sentence verdict and one piece of evidence from the paper that supports your verdict.

Funnel Plot Interpretation (15 minutes)

Search for a meta-analysis of any software engineering practice (code review, pair programming, TDD, etc.) that includes a funnel plot. Describe what the funnel plot shows. Is there evidence of publication bias? If the funnel is asymmetric, what does that suggest about the true effect size compared to what the published literature reports?

Find the Spin (15 minutes)

Find a research paper or industry report about AI tools and developer productivity published in the last two years. Identify one place where the authors' conclusions go beyond what their methods can support—what a statistician would call "spin." Write two sentences: one quoting the overclaimed statement, one describing what the evidence actually shows.