How To Not Be Wrong About AI

Greg Wilson

June 2026

http://third-bit.com/notwrong/

What This Talk Is About

AI coding tools are everywhere, and so are claims about what they do
In 2023, GitHub announced their tools made developers 55% faster
That number appeared in every executive presentation for a year
This workshop teaches you how to tell if you should believe it or not

Why Media Coverage Fails You

Journalists rarely read past the abstract of research reports
Abstracts rarely report limitations
People rarely publish negative results
Press releases are written to generate coverage, not accuracy
Conflicts of interest are common
- Most AI productivity studies are funded by the companies selling the tools

Empirical Software Engineering

Empirical software engineering (ESE) uses observation and experiment to study how software is built
It draws on psychology, sociology, economics, and statistics as well as computer science
The field has existed since the 1960s
Bad news: most studies don't address things practitioners actually care about [Begel2014]
Good news: "most" isn't "all"

The Question You Actually Need to Answer

"Is AI helping my team?" sounds simple
But it requires answering:
- Helping with what?
- Measured how?
- Compared to what?
These are not pedantic questions
The answer you get depends entirely on how you operationalize them

For Example

Prechelt measured 73 professional developers solving the same programming task and found completion times ranged from 0.6 to 63 hours (i.e., 105X) [Prechelt2000]
After controlling for programming language the ratio shrank to 17X
With a careful definition of "more productive" it shrank further to 5X
The answer depends a lot on exactly what question is asked

Claims, Studies, and Evidence

A claim is an assertion: "AI tools make programmers more productive"
A study is a systematic attempt to test a claim
Evidence is what a study produces, and it varies in quality
One study with a nice headline can change hiring practices and university curricula before anyone checks whether it replicates

Why "Productivity" Is Hard to Define

Manufacturing productivity means widgets per hour: both terms are measurable
Software output is not homogeneous: a ten-line bugfix may be worth more than a thousand-line feature
Much of software work is invisible: reading, reviewing, helping colleagues
- A field study found developers only spend 25% of their day actually writing code [Meyer2017]
[Sadowski2019] is an entire book devoted to "this is really hard"

Construct Validity and Proxy Metrics

Construct validity is the degree to which a measurement captures the concept it is meant to represent
Lines of code written per day has low construct validity as a productivity measure
- You can write more lines by making code worse
A proxy metric stands in for something harder to measure directly
- Common proxies: lines of code, commit frequency, story points, pull requests merged
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure
- Particularly when people feel their jobs are threatened…

What You Can and Cannot Measure

The DORA metrics have better construct validity than activity counts [Forsgren2018]
- Deployment frequency, lead time, change failure rate, time to restore
They are tied to customer outcomes rather than developer busyness
They still have blind spots: a team can score well while building the wrong product

The Big Three Mistakes

Counting lines of code generated
- Measures verbosity, not value
Timing artificial tasks
- A 90-minute greenfield task does not predict real work
Measuring only the easy half
- AI makes code generation faster, but doesn't count review time, debugging confidently wrong suggestions, and security vulnerabilities

Bias and Baselines

Before/after with no control group
- You cannot separate the AI effect from anything else that changed
Asking developers if they feel more productive
- The novelty effect inflates self-reports for weeks
- Feeling productive is not the same as being productive
Comparing volunteers to non-volunteers
- Early adopters are usually already higher performers

Metrics That Mislead

Treating adoption rate as a success metric
- It measures whether the tool is installed, not whether it helps
Treating suggestion acceptance rate as a quality signal
- Developers under pressure accept more suggestions, including insecure ones [Pearce2022]
Comparing AI to nothing
- the relevant question is whether AI outperforms the alternatives developers already have

Qualitative Methods: When and Why

Qualitative methods are for when you do not yet know what to measure
They answer "why" and "what is happening here" rathr than "how much"
A survey of 410 developers about AI tools revealed where AI actually helps and where it gets in the way [Liang2024]
- Invisible to any study measuring only task completion times
The question determines the method, not preference or habit

Designing Good Interviews and Surveys

Semi-structured interviews have a guide but allow follow-up
- Consistent enough to compare, flexible enough to surface surprises
Open questions invite narrative; closed questions invite classification
Avoid leading questions: "Don't you find it faster?" assumes the answer
Pilot your survey with 3–5 people before distributing it

Thematic Analysis

Open coding: read through the data and tag segments with descriptive labels
Use gerund coding: "avoiding AI for security tasks" rather than "AI distrust"
- This preserves what participants are actually doing
A theme is a claim you could write as a sentence, not a bucket for related quotes
Stop collecting data when new interviews stop introducing new codes [Braun2019]

Controlled Experiments

A controlled experiment manipulates one variable and measures its effect while holding others constant
Randomization assigns participants to conditions randomly, distributing unknown confounders evenly
- This is the mechanism that makes causal claims defensible
Full blinding is rarely possible in software engineering:
- you cannot hide from a developer that they are using TDD

p-Values: What They Are and Are Not

A p-value is the probability of observing data at least as extreme as yours if nothing was actually happening
It is not the probability that the null hypothesis is true
It is not the probability that you will replicate
p < 0.05 is a convention from the 1920s, not a law of nature

Effect Size Matters

Statistical significance tells you whether an effect is likely to be real
Effect size tells you how large it is
A study with thousands of participants can find statistically significant effects that are too small to matter in practice
Always report effect size alongside p-values; one without the other is incomplete
- Equally, if a study doesn't report both, it probably has other flaws as well

Most SE Experiments Are Underpowered

Statistical power is the probability of detecting an effect if one exists
Studies with 20–30 participants can only detect very large effects [Kampenes2007]
Most software engineering experiments fall far below this threshold
The effects you do detect in underpowered studies are inflated — The winner's curse (reluctance to publish negative results)

Observational Studies: Watching the World

Observational studies measure the world as it is, without manipulating variables
Advantages: real-world behavior, large datasets, no ethical concerns about withholding interventions
Disadvantage: you cannot establish causation because confounding variables cannot be ruled out
Mining software repositories (MSR) is the most common approach

Looking Where the Light Is

GitHub data is not a representative sample of software development
Inactive repositories, class assignments, personal experiments, and mirrors all appear alongside production software
Survivorship bias: you only see projects that still exist
Example: projects with more tests also tend to have more experienced developers — You cannot attribute lower defect rates to testing alone

Reading Studies Critically

Start with the abstract: what claim is being made?
Jump to the methods before reading the results
- Given this design, what can this study actually establish?
Read the limitations section
- What do the authors say they cannot conclude?
- If this feels flimsy, the rest of the paper probably is as well

HARKing and p-Hacking

HARKing (Hypothesizing After Results are Known): writing a paper as if a pattern found during analysis was predicted in advance
p-hacking: trying multiple analyses until you get p < 0.05, then reporting only that analysis
- With 20 independent tests at p < 0.05, you expect one false positive by chance
Pre-registration commits hypotheses before data collection, making both problems visible
And if authors haven't shared data, there's a good chance there's an error in their work [Wicherts2011]

A Checklist for Evaluating a Study

Conclusion validity: Was the sample large enough? Are effect sizes reported?
Internal validity: Was there a control group? Was assignment random? Could novelty or learning effects explain the result?
Construct validity: Does the measurement actually capture the claim?
External validity: Who are the subjects, and are they representative of the population the conclusions address?

Goal-Question-Metric

GQM provides a structured path from intent to measurement [Basili1994]
Define the goal: what object, what property, from whose viewpoint, in what context?
Generate the questions whose answers would tell you whether the goal was achieved
Identify the specific, operationalized metric that answers each question
- And challenge whether the metric actually measures what you care about
- Again, it helps to do this with someone outside your org

Starting Cheaply

A small, informal study produces better evidence than a meeting
"Research is the process of finding out what you don't know" [Hall2019]
The standard for "enough" depends on the decision you are trying to support
- Deciding to run a larger study: five sessions may be enough
- Deciding to mandate a tool for five hundred developers: nope
Your goal is not to be rigorous
It is to be better informed at minimal cost

Formative vs. Summative Evaluation

Formative evaluation: studying something in order to improve it
- Small samples, rapid iteration, qualitative focus
- The question is "what is wrong?" not "how often is it wrong?"
Summative evaluation: assessing if something works well enough to deploy
- Requires more participants, controlled conditions, and quantitative metrics
Most of what you can do without institutional support is formative
Mislabeling formative evidence as summative kills credibility

Finding Participants Without a Budget

Post in chat: "I need 45 minutes of your time to watch you do a code review"
Ask colleagues in adjacent teams who are not on your project
Aim for people who fit your target profile, not whoever responds first
Avoid people you manage, who manage you, or who already know your hypothesis
Five participants is typically enough to identify most problems in a specific workflow [Nielsen1993]
- This does not apply to claims about "most developers"

Running a Think-Aloud Session

Ask participants to verbalize their thoughts as they work: "say out loud whatever you are thinking"
Give concrete, realistic tasks, not open exploration
- "Given this pull request, use the AI assistant to write a one-paragraph summary of the changes"
Observe where participants hesitate, backtrack, or state expectations that turn out to be wrong
Do not help when they get stuck or explain what the tool was designed to do
- You will not be there when real users encounter the same problems

The RITE Method and What You Can Claim

[Medlock2002]: fix the most severe problem you observed before the next session
- A session that reveals a new problem is more useful than one that confirms a known one
What five-session informal studies can support:
- "We observed developers getting stuck at this step"
- "In most sessions, participants…"
What they cannot support:
- "X% of developers have this problem"
- "Using this tool causes developers to write more bugs"

So, What Do We Know?

Code is statistically more repetitive and predictable than natural language [Hindle2016]
- This is why language models work well for it
Nearly all studies are short-term, use narrow tasks, and rely on volunteers
- External validity to professional development is largely assumed
Controlled experiments show AI tools can speed up specific, well-defined tasks for individual developers
Effects on end-to-end delivery (defect rates, lead time) are much less clear

What to Do Next

Ask for evidence before adopting tools: what study supports this claim, is it applicable to your context, and can you see the raw data?
When an executive cites an AI productivity statistic, ask who funded the study, what specific task was studied, and whether there was a control group
Document your organization's experience and share it
- Your customers are more likely to trust you if you're honestly self-critical
- Your staff are more likely to trust you as well

Sharing Results Responsibly

Every study has limitations; state them before stakeholders ask
Distinguish between "we found no effect" and "our study was not designed to detect that effect"
Present uncertainty: confidence intervals and effect sizes belong in results presented to management
Negative results matter too

Thank You

Greg Wilson

gvwilson@third-bit.com

http://third-bit.com/notwrong/

start where you are · use what you have · help who you can