Reading Studies Critically

In 2023, McKinsey published "Yes, You Can Measure Software Developer Productivity" [McKinsey2023]. It proposed a framework involving dozens of metrics across multiple dimensions and was widely cited by managers and executives as justification for measuring individual developer output. Researchers found the paper's methodology thin and its claims overclaimed. The paper was not obviously fraudulent—it was something more insidious: confident, plausible, and wrong in ways that non-specialists could not easily detect. Kent Beck, one of the authors of the Agile Manifesto.

The gap between belief and evidence runs deep. [Devanbu2016] studied developers at Microsoft and found that beliefs are strong, but formed from personal experience rather than research, and do not reliably correspond to actual data from the projects those developers work on. This session gives you the critical reading skills needed to do better.

How to Read a Research Paper

HARKing

p-Hacking

Publication Bias

From [Wicherts2011]:

We related the reluctance to share research data for reanalysis to 1148 statistically significant results reported in 49 papers published in two major psychology journals. We found the reluctance to share data to be associated with weaker evidence…and a higher prevalence of apparent errors in the reporting of statistical results. The unwillingness to share data was particularly clear when reporting errors had a bearing on statistical significance.

In other words, if people aren't willing to make their data available for you to analyze, it's very likely that they're overstating their findings or have made a mistake in their analysis.

Conflicts of Interest

Operationalization as a Site of Flexibility

A Checklist for Evaluating an Empirical SE Study

When the Outcome Measure Is a Dashboard

Misconceptions

Peer review detects p-hacking and HARKing.
Reviewers see the final manuscript, not the analysis history. Without access to pre-registered protocols or raw data, they cannot distinguish a cleanly hypothesized analysis from one that was tidied up after the fact.
Pre-registration guarantees that a study's results are valid.
It reduces one class of problem—undisclosed flexibility in analysis—by making deviations visible. It does not fix flawed designs, bad measurements, unrepresentative samples, or errors in execution.
Industry-funded research is always biased toward favorable results.
"Always" is too strong: some industry-funded studies are rigorous and honest. The concern is systematic, not universal: on average, funding source predicts outcome in ways that independent replication often does not support.
A well-cited paper is probably right. Citation counts reflect influence, not accuracy.
Papers get cited because they tell a useful or interesting story, whether or not the story holds up to scrutiny. Many highly-cited results in software engineering have failed to replicate.

Check Understanding

What is the difference between p-hacking and HARKing? Can a study engage in both?

p-hacking is running many analyses and selectively reporting the one that gives p < 0.05. HARKing is presenting exploratory findings as if they were predicted hypotheses. A study can engage in both if the researcher explores the data (p-hacking), finds a pattern, then writes the paper as if that pattern was hypothesized in advance (HARKing). Both inflate false positive rates and make the literature look more certain than it is.

A paper reports 12 comparisons across different subgroups and finds that one is significant at p = 0.04. Should you conclude there is a real effect? Why or why not?

No. With 12 independent comparisons at α = 0.05, you expect 0.6 false positives by chance. A single significant result from 12 comparisons is entirely consistent with chance, and with p = 0.04 (just barely significant), it is especially suspicious. The appropriate response is to apply a multiple comparisons correction (such as Bonferroni, which would require p < 0.004 for any individual test) and treat the result as preliminary until replicated.

What is the file drawer problem and how does it affect the credibility of published research?

The file drawer problem refers to the tendency for studies that found no significant effect to go unpublished. Because journals prefer positive results, researchers are less ikely to submit null findings. The consequence is that the published literature overrepresents positive findings: if you read all studies on a topic, you will see more support for effects than actually exists in the world. Meta-analyses that include only published studies inherit this bias.

The following abstract contains a claim that is not supported by typical RCT design. Identify and explain the problem: "Our randomized controlled trial shows that developers using our tool are more productive. We therefore recommend that all engineering teams adopt it."

An RCT establishes that the treatment caused an outcome in the studied population under the studied conditions. "More productive" in an RCT typically means "faster at the specific task used in the study". The recommendation to "all engineering teams" is an external validity claim: that the effect generalizes beyond the study's sample (often students or volunteers at one organization), task (often artificial or narrow), and context. Most SE experiments cannot support broad adoption recommendations because their external validity is limited.

Exercises

Apply the Checklist (20 minutes)

Apply the checklist from this lesson to [Peng2023] or [McKinsey2023]. For each item, give a one-sentence verdict and one piece of evidence from the paper that supports your verdict.

Funnel Plot Interpretation (15 minutes)

Search for a meta-analysis of any software engineering practice (code review, pair programming, TDD, etc.) that includes a funnel plot. Describe what the funnel plot shows. Is there evidence of publication bias? If the funnel is asymmetric, what does that suggest about the true effect size compared to what the published literature reports?

Find the Spin (15 minutes)

Find a research paper or industry report about AI tools and developer productivity published in the last two years. Identify one place where the authors' conclusions go beyond what their methods can support (what a statistician would call "spin".) Write two sentences: one quoting the overclaimed statement, one describing what the evidence actually shows.