Introduction: What Do We Actually Know?
In 2023, a team at GitHub published a paper claiming that developers using Copilot completed tasks 55% faster than those who did not [Peng2023]. That number spread through tech media like a cold through an open-plan office. Executives cited it in all-hands meetings. Blog posts called it a "landmark study." A few people asked whether 55% was even plausible.
They were right to ask. The study used 95 participants on a single artificial task (implementing an HTTP server in JavaScript), lasted ninety minutes, and was funded by GitHub. None of that makes it wrong, but it does mean "55% faster" applies to a very narrow slice of software development. This workshop is about learning to ask that kind of question systematically.
What Empirical Software Engineering Is
- Empirical software engineering (ESE) uses observation and experiment to study how software is built, who builds it, and what practices actually work
- It draws on methods from psychology, sociology, economics, and statistics
- The field has existed since at least the 1970s (Sackman et al.'s controversial 1968 study on programmer productivity is an early example) [Sackman1968]
- Even simple-sounding questions resist easy answers: Prechelt measured 73 professional developers solving the same programming task and found completion times ranged from 0.6 to 63 hours — a 105-fold difference [Prechelt2000]. After controlling for programming language the ratio shrank to 17-fold; with a careful definition of "more productive" it shrank further to 5-to-11-fold. The answer changed with every definitional choice
- ESE is distinct from "building better tools"—its goal is knowledge, not artifacts
Why Media Coverage Is Unreliable
- Journalists rarely read past the abstract, and abstracts rarely report limitations
- Press releases are written by communications teams whose job is to generate coverage, not accuracy
- Preprints (papers not yet peer-reviewed) circulate as freely as peer-reviewed papers, with no visible distinction to non-specialists
- Conflict of interest is common: a large fraction of AI/productivity studies are funded by the companies selling the tools being studied
- Replication is rare in software engineering research: most results are reported once and never checked
Claims, Studies, and Evidence
- A claim is an assertion: "AI tools make programmers more productive"
- A study is a systematic attempt to test a claim
- Evidence is what a study produces, and evidence varies in quality
- The chain from claim to policy is often much shorter than it should be: one study with a nice headline can change hiring practices, procurement decisions, and university curricula
- Good scientific practice distinguishes between "this study found X" and "X is true"
Qualitative vs. Quantitative
- Quantitative methods
measure things: task completion time, defect rate, commit frequency
- Good for "how much" and "is there a difference"
- Requires that you can define and measure what you care about
- Qualitative methods
describe things: what developers think about a tool,
how teams decide what to build, why a particular practice was adopted
- Good for "why" and "what is happening here"
- Requires that you can access and interpret people's experiences
- Neither is inherently superior; the question determines the method
- Mixed-methods studies use both, which is more expensive but often more informative
Workshop Roadmap
- Session 2: What does "productivity" mean, and can we measure it?
- Session 3: Qualitative methods—interviews and surveys
- Session 4: Quantitative methods—controlled experiments
- Session 5: Observational studies and natural experiments
- Session 6: Reading studies critically
- Session 7: Running studies in your own organization
- Session 8: Synthesis and next steps
Misconceptions
- Peer review means a study is correct.
- Peer review is a filter for obvious errors and implausible claims, not a certificate of validity. Reviewers rarely re-run analyses, check raw data, or verify that reported results follow from the methods described.
- A larger sample always makes a study better.
- Sample size matters, but it cannot fix a flawed design. A study with ten thousand participants on an irrelevant task still tells you nothing about the question you care about.
- Qualitative methods are less rigorous than quantitative methods.
- They use different standards of rigor—saturation instead of power, audit trails instead of pre-registration—not lower ones. Choosing the wrong method for a question is the rigor failure, not using qualitative methods at all.
- One well-designed study settles a question.
- Single studies, however well designed, report what happened in one sample under one set of conditions. A finding becomes reliable through replication across different samples, contexts, and research teams.
Check Understanding
What is the difference between a claim and evidence?
A claim is an assertion—"AI tools make programmers more productive." Evidence is what a systematic study produces when it tests that claim. Evidence varies in quality depending on the study design, sample size, context, and potential for bias. A single study produces evidence for or against a specific claim under specific conditions; it does not prove or disprove the claim in general.
The GitHub Copilot study found a 55% speed increase. A colleague says this means "AI doubles productivity." Identify two specific problems with that interpretation.
First, the study used an artificial task (implementing an HTTP server in JavaScript in 90 minutes), which may not represent the full range of software development work. Second, "55% faster on one task" is not the same as "doubles overall productivity"—productivity involves much more than task completion time, and the effect on the full working day is unknown. A third problem, if needed: the study was funded by GitHub, which has a financial interest in Copilot appearing effective.
A researcher says: "We used qualitative methods because quantitative methods can't capture the nuances of developer experience." A second researcher says: "We used quantitative methods because qualitative findings can't be generalized." Which researcher is right?
Neither is straightforwardly right. Qualitative methods are better suited to understanding why and what is happening, while quantitative methods are better suited to measuring how much and whether there is a difference. The choice of method should depend on the research question, not on a general preference. Dismissing the other approach without considering the question is a sign of methodological tribalism, not sound reasoning.
The following sentence contains an error in statistical reasoning. Identify and fix it: "The study was pre-registered, so its results must be valid."
Pre-registration means the researchers committed to their hypotheses and analysis plan before collecting data, which reduces the risk of HARKing (Hypothesizing After Results are Known) and p-hacking. However, it does not guarantee valid results. A pre-registered study can still have design flaws, measurement problems, small samples, biased participants, or errors in analysis. Pre-registration is a quality indicator for transparency, not a certificate of correctness.
Exercises
Find the Funding (15 minutes)
Find a recent news article about an AI tool's impact on software development. Locate the original study it references (or the closest study you can find). Identify who funded the study, what the actual sample was, what task was studied, and what the abstract claims versus what the methods section actually supports. Write two or three sentences summarizing the gap between the headline and the evidence.
Qualitative or Quantitative? (10 minutes)
For each of the following research questions, decide whether you would primarily use qualitative or quantitative methods, and give one sentence explaining why:
- Do developers who use pair programming make fewer defects?
- Why do some teams adopt test-driven development while others reject it?
- How do developers decide when to ask an AI assistant for help versus working through a problem themselves?
- Does code review time decrease when teams adopt a particular tool?
Replication Reality Check (20 minutes)
Pick one claim you have heard or read about AI and software development (e.g., "AI tools reduce onboarding time," "AI-generated code has more security vulnerabilities"). Search for at least two studies that address this claim. Do they agree? If not, what differences in method might explain the disagreement?