AI coding tools are everywhere, and so are claims about what they do
In 2023, GitHub announced their tools made developers 55% faster
That number appeared in every executive presentation for a year
This workshop teaches you how to tell if you should believe it or not
Why Media Coverage Fails You
Journalists rarely read past the abstract of research reports
Abstracts rarely report limitations
People rarely publish negative results
Press releases are written to generate coverage, not accuracy
Conflicts of interest are common
Most AI productivity studies are funded by the companies selling the tools
Empirical Software Engineering
Empirical software engineering (ESE) uses observation and experiment to study how software is built
It draws on psychology, sociology, economics, and statistics
as well as computer science
The field has existed since the 1960s
Bad news: most studies don't address things practitioners actually care about [Begel2014]
Good news: "most" isn't "all"
The Question You Actually Need to Answer
"Is AI helping my team?" sounds simple
But it requires answering:
Helping with what?
Measured how?
Compared to what?
These are not pedantic questions
The answer you get depends entirely on how you operationalize them
For Example
Prechelt measured 73 professional developers solving the same programming task
and found completion times ranged from 0.6 to 63 hours (i.e., 105X) [Prechelt2000]
After controlling for programming language the ratio shrank to 17X
With a careful definition of "more productive" it shrank further to 5X
The answer depends a lot on exactly what question is asked
Claims, Studies, and Evidence
A claim is an assertion: "AI tools make programmers more productive"
A study is a systematic attempt to test a claim
Evidence is what a study produces, and it varies in quality
One study with a nice headline can change hiring practices and university curricula
before anyone checks whether it replicates
Why "Productivity" Is Hard to Define
Manufacturing productivity means widgets per hour: both terms are measurable
Software output is not homogeneous: a ten-line bugfix may be worth more than a thousand-line feature
Much of software work is invisible: reading, reviewing, helping colleagues
A field study found developers only spend 25% of their day actually writing code [Meyer2017]
[Sadowski2019] is an entire book devoted to "this is really hard"
Construct Validity and Proxy Metrics
Construct validity is the degree to which a measurement captures the concept it is meant to represent
Lines of code written per day has low construct validity as a productivity measure
You can write more lines by making code worse
A proxy metric stands in for something harder to measure directly
Common proxies: lines of code, commit frequency, story points, pull requests merged
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure
Particularly when people feel their jobs are threatened…
What You Can and Cannot Measure
The DORA metrics have better construct validity than activity counts [Forsgren2018]
Deployment frequency, lead time, change failure rate, time to restore
They are tied to customer outcomes rather than developer busyness
They still have blind spots: a team can score well while building the wrong product
The Big Three Mistakes
Counting lines of code generated
Measures verbosity, not value
Timing artificial tasks
A 90-minute greenfield task does not predict real work
Measuring only the easy half
AI makes code generation faster,
but doesn't count review time,
debugging confidently wrong suggestions,
and security vulnerabilities
Bias and Baselines
Before/after with no control group
You cannot separate the AI effect from anything else that changed
Asking developers if they feel more productive
The novelty effect inflates self-reports for weeks
Feeling productive is not the same as being productive
Comparing volunteers to non-volunteers
Early adopters are usually already higher performers
Metrics That Mislead
Treating adoption rate as a success metric
It measures whether the tool is installed, not whether it helps
Treating suggestion acceptance rate as a quality signal
Developers under pressure accept more suggestions, including insecure ones [Pearce2022]
Comparing AI to nothing
the relevant question is whether AI outperforms the alternatives developers already have
Qualitative Methods: When and Why
Qualitative methods are for when you do not yet know what to measure
They answer "why" and "what is happening here" rathr than "how much"
A survey of 410 developers about AI tools revealed
where AI actually helps and where it gets in the way [Liang2024]
Invisible to any study measuring only task completion times
The question determines the method, not preference or habit
Designing Good Interviews and Surveys
Semi-structured interviews have a guide but allow follow-up
Consistent enough to compare, flexible enough to surface surprises
Open questions invite narrative; closed questions invite classification
Avoid leading questions: "Don't you find it faster?" assumes the answer
Pilot your survey with 3–5 people before distributing it
Thematic Analysis
Open coding: read through the data and tag segments with descriptive labels
Use gerund coding: "avoiding AI for security tasks" rather than "AI distrust"
This preserves what participants are actually doing
A theme is a claim you could write as a sentence, not a bucket for related quotes
Stop collecting data when new interviews stop introducing new codes [Braun2019]
Controlled Experiments
A controlled experiment manipulates one variable and measures its effect while holding others constant
Randomization assigns participants to conditions randomly, distributing unknown confounders evenly
This is the mechanism that makes causal claims defensible
Full blinding is rarely possible in software engineering:
you cannot hide from a developer that they are using TDD
p-Values: What They Are and Are Not
A p-value is the probability of observing data at least as extreme as yours
if nothing was actually happening
It is not the probability that the null hypothesis is true
It is not the probability that you will replicate
p < 0.05 is a convention from the 1920s, not a law of nature
Effect Size Matters
Statistical significance tells you whether an effect is likely to be real
Effect size tells you how large it is
A study with thousands of participants can find statistically significant effects
that are too small to matter in practice
Always report effect size alongside p-values; one without the other is incomplete
Equally, if a study doesn't report both, it probably has other flaws as well
Most SE Experiments Are Underpowered
Statistical power is the probability of detecting an effect if one exists
Studies with 20–30 participants can only detect very large effects [Kampenes2007]
Most software engineering experiments fall far below this threshold
The effects you do detect in underpowered studies are inflated
— The winner's curse (reluctance to publish negative results)
Observational Studies: Watching the World
Observational studies measure the world as it is, without manipulating variables
Advantages: real-world behavior, large datasets, no ethical concerns about withholding interventions
Disadvantage: you cannot establish causation because confounding variables cannot be ruled out
Mining software repositories (MSR) is the most common approach
Looking Where the Light Is
GitHub data is not a representative sample of software development
Inactive repositories, class assignments, personal experiments, and mirrors
all appear alongside production software
Survivorship bias: you only see projects that still exist
Example: projects with more tests also tend to have more experienced developers
— You cannot attribute lower defect rates to testing alone
Reading Studies Critically
Start with the abstract: what claim is being made?
Jump to the methods before reading the results
Given this design, what can this study actually establish?
Read the limitations section
What do the authors say they cannot conclude?
If this feels flimsy, the rest of the paper probably is as well
HARKing and p-Hacking
HARKing (Hypothesizing After Results are Known):
writing a paper as if a pattern found during analysis was predicted in advance
p-hacking: trying multiple analyses until you get p < 0.05, then reporting only that analysis
With 20 independent tests at p < 0.05, you expect one false positive by chance
Pre-registration commits hypotheses before data collection, making both problems visible
And if authors haven't shared data, there's a good chance there's an error in their work [Wicherts2011]
A Checklist for Evaluating a Study
Conclusion validity:
Was the sample large enough?
Are effect sizes reported?
Internal validity:
Was there a control group?
Was assignment random?
Could novelty or learning effects explain the result?
Construct validity:
Does the measurement actually capture the claim?
External validity:
Who are the subjects, and are they representative of the population the conclusions address?
Goal-Question-Metric
GQM provides a structured path from intent to measurement [Basili1994]
Define the goal: what object, what property, from whose viewpoint, in what context?
Generate the questions whose answers would tell you whether the goal was achieved
Identify the specific, operationalized metric that answers each question
And challenge whether the metric actually measures what you care about
Again, it helps to do this with someone outside your org
Starting Cheaply
A small, informal study produces better evidence than a meeting
"Research is the process of finding out what you don't know" [Hall2019]
The standard for "enough" depends on the decision you are trying to support
Deciding to run a larger study: five sessions may be enough
Deciding to mandate a tool for five hundred developers: nope
Your goal is not to be rigorous
It is to be better informed at minimal cost
Formative vs. Summative Evaluation
Formative evaluation: studying something in order to improve it
Small samples, rapid iteration, qualitative focus
The question is "what is wrong?" not "how often is it wrong?"
Summative evaluation: assessing if something works well enough to deploy
Requires more participants, controlled conditions, and quantitative metrics
Most of what you can do without institutional support is formative
Mislabeling formative evidence as summative kills credibility
Finding Participants Without a Budget
Post in chat: "I need 45 minutes of your time to watch you do a code review"
Ask colleagues in adjacent teams who are not on your project
Aim for people who fit your target profile, not whoever responds first
Avoid people you manage, who manage you, or who already know your hypothesis
Five participants is typically enough to identify most problems in a specific workflow [Nielsen1993]
This does not apply to claims about "most developers"
Running a Think-Aloud Session
Ask participants to verbalize their thoughts as they work: "say out loud whatever you are thinking"
Give concrete, realistic tasks, not open exploration
"Given this pull request, use the AI assistant to write a one-paragraph summary of the changes"
Observe where participants hesitate, backtrack, or state expectations that turn out to be wrong
Do not help when they get stuck or explain what the tool was designed to do
You will not be there when real users encounter the same problems
The RITE Method and What You Can Claim
[Medlock2002]: fix the most severe problem you observed before the next session
A session that reveals a new problem is more useful than one that confirms a known one
What five-session informal studies can support:
"We observed developers getting stuck at this step"
"In most sessions, participants…"
What they cannot support:
"X% of developers have this problem"
"Using this tool causes developers to write more bugs"
So, What Do We Know?
Code is statistically more repetitive and predictable than natural language [Hindle2016]
This is why language models work well for it
Nearly all studies are short-term, use narrow tasks, and rely on volunteers
External validity to professional development is largely assumed
Controlled experiments show AI tools can speed up specific, well-defined tasks for individual developers
Effects on end-to-end delivery (defect rates, lead time) are much less clear
What to Do Next
Ask for evidence before adopting tools:
what study supports this claim, is it applicable to your context, and can you see the raw data?
When an executive cites an AI productivity statistic, ask who funded the study,
what specific task was studied,
and whether there was a control group
Document your organization's experience and share it
Your customers are more likely to trust you if you're honestly self-critical
Your staff are more likely to trust you as well
Sharing Results Responsibly
Every study has limitations; state them before stakeholders ask
Distinguish between "we found no effect" and "our study was not designed to detect that effect"
Present uncertainty:
confidence intervals and effect sizes belong in results presented to management