Introduction: What Do We Actually Know?

In 2023, a team at GitHub published a paper claiming that developers using Copilot completed tasks 55% faster than those who did not [Peng2023]. That number spread through tech media like a cold through an open-plan office. Executives cited it in all-hands meetings. Blog posts called it a "landmark study." A few people asked whether 55% was even plausible.

They were right to ask. The study used 95 participants on a single artificial task (implementing an HTTP server in JavaScript), lasted ninety minutes, and was funded by GitHub. None of that makes it wrong, but it does mean "55% faster" applies to a very narrow slice of software development. This workshop is about learning to ask that kind of question systematically.

What Empirical Software Engineering Is

Why Media Coverage Is Unreliable

Claims, Studies, and Evidence

Qualitative vs. Quantitative

Workshop Roadmap

Misconceptions

Peer review means a study is correct.
Peer review is a filter for obvious errors and implausible claims, not a certificate of validity. Reviewers rarely re-run analyses, check raw data, or verify that reported results follow from the methods described.
A larger sample always makes a study better.
Sample size matters, but it cannot fix a flawed design. A study with ten thousand participants on an irrelevant task still tells you nothing about the question you care about.
Qualitative methods are less rigorous than quantitative methods.
They use different standards of rigor—saturation instead of power, audit trails instead of pre-registration—not lower ones. Choosing the wrong method for a question is the rigor failure, not using qualitative methods at all.
One well-designed study settles a question.
Single studies, however well designed, report what happened in one sample under one set of conditions. A finding becomes reliable through replication across different samples, contexts, and research teams.

Check Understanding

What is the difference between a claim and evidence?

A claim is an assertion—"AI tools make programmers more productive." Evidence is what a systematic study produces when it tests that claim. Evidence varies in quality depending on the study design, sample size, context, and potential for bias. A single study produces evidence for or against a specific claim under specific conditions; it does not prove or disprove the claim in general.

The GitHub Copilot study found a 55% speed increase. A colleague says this means "AI doubles productivity." Identify two specific problems with that interpretation.

First, the study used an artificial task (implementing an HTTP server in JavaScript in 90 minutes), which may not represent the full range of software development work. Second, "55% faster on one task" is not the same as "doubles overall productivity"—productivity involves much more than task completion time, and the effect on the full working day is unknown. A third problem, if needed: the study was funded by GitHub, which has a financial interest in Copilot appearing effective.

A researcher says: "We used qualitative methods because quantitative methods can't capture the nuances of developer experience." A second researcher says: "We used quantitative methods because qualitative findings can't be generalized." Which researcher is right?

Neither is straightforwardly right. Qualitative methods are better suited to understanding why and what is happening, while quantitative methods are better suited to measuring how much and whether there is a difference. The choice of method should depend on the research question, not on a general preference. Dismissing the other approach without considering the question is a sign of methodological tribalism, not sound reasoning.

The following sentence contains an error in statistical reasoning. Identify and fix it: "The study was pre-registered, so its results must be valid."

Pre-registration means the researchers committed to their hypotheses and analysis plan before collecting data, which reduces the risk of HARKing (Hypothesizing After Results are Known) and p-hacking. However, it does not guarantee valid results. A pre-registered study can still have design flaws, measurement problems, small samples, biased participants, or errors in analysis. Pre-registration is a quality indicator for transparency, not a certificate of correctness.

Exercises

Find the Funding (15 minutes)

Find a recent news article about an AI tool's impact on software development. Locate the original study it references (or the closest study you can find). Identify who funded the study, what the actual sample was, what task was studied, and what the abstract claims versus what the methods section actually supports. Write two or three sentences summarizing the gap between the headline and the evidence.

Qualitative or Quantitative? (10 minutes)

For each of the following research questions, decide whether you would primarily use qualitative or quantitative methods, and give one sentence explaining why:

Replication Reality Check (20 minutes)

Pick one claim you have heard or read about AI and software development (e.g., "AI tools reduce onboarding time," "AI-generated code has more security vulnerabilities"). Search for at least two studies that address this claim. Do they agree? If not, what differences in method might explain the disagreement?