Conclusion: What to Do Next
In 2001, Mark Harman and Bryan Jones published a paper proposing that software engineering problems could be treated as search problems—find the test suite that maximizes coverage, or the configuration that minimizes defects—and solved with optimization algorithms [Harman2001]. At the time, this seemed like an interesting theoretical curiosity. Two decades later, search-based software engineering is a thriving subfield with annual conferences, graduate programs, and practical tools. It exists because someone decided to ask a question that had not previously been asked.
The questions researchers ask shape what the field knows. The questions practitioners ask shape what gets funded and studied. You now have the tools to ask better questions.
Matching Research Questions to Methods
- "Is there a difference?" → controlled experiment or quasi-experiment
- "How large is the difference?" → effect size and confidence intervals
- "Why does this happen?" → qualitative methods (interviews, observation)
- "What do practitioners think?" → surveys
- "What patterns exist in large codebases?" → mining software repositories
- "What causes what?" → natural experiments, DiD, ITS—or randomization
- No single method answers all questions; complex questions need mixed methods
- The hardest step is often translating a vague concern ("is AI bad for us?") into a specific, answerable question
What We Know About AI and Developer Productivity
- Controlled experiments show AI tools can speed up specific, well-defined tasks (code generation, boilerplate, test writing) for individual developers
- Effects on end-to-end software delivery (defect rates, lead time, system quality) are much less studied and less consistent
- The "best" developers appear to benefit less or even be slowed down; the "least experienced" developers show the largest gains on narrow tasks [Peng2023]
- Long-term effects (skill development, code maintainability, team knowledge sharing) have barely been studied
- Nearly all studies are short-term, use artificial or narrow tasks, and are conducted on volunteers or students—external validity to large-scale professional software development is largely assumed
What Is Being Oversold
- Aggregate productivity claims ("AI makes everyone 55% faster") are derived from specific task studies; the generalization is rarely justified
- Industry reports on AI adoption frequently lack peer review, methodology sections, and conflict-of-interest disclosures
- The absence of replications is treated as evidence the original finding stands, rather than as a gap in the evidence base
- Environmental and societal costs of AI-assisted development receive almost no empirical attention in software engineering research
Staying Current Without Drowning
- Follow proceedings of top venues: ICSE, FSE, MSR, ESEM, ASE
- Use Google Scholar alerts for specific research questions rather than general topic feeds
- Check pre-registration databases (AsPredicted, OSF) for upcoming studies on topics you care about
- Be skeptical of preprints until they appear in peer-reviewed venues, and skeptical of peer-reviewed papers until they are replicated
- Build a network: one well-chosen researcher to follow is more valuable than a hundred papers to skim
What Good Research Would Look Like
- Studies of process maturity programs provide the longest available evidence on organizational productivity change: across 14 CMM improvement programs, Herbsleb et al. found a median 35% annual productivity improvement; one company reduced rework from 41% to 6% of effort over a decade of sustained effort [Sadowski2019]. Most AI productivity claims have no comparable long-term evidence
- Pre-registered, adequately powered studies with real professional developers on real tasks over realistic time periods
- Independent replication before claims enter policy or practice
- Outcome measures that include code quality, team dynamics, and developer well-being, not just speed
- Transparent conflict-of-interest disclosure and independent funding
- Long-term follow-up on deskilling, dependency, and maintainability
How to Push for Better Evidence
- Ask for evidence before adopting tools at your organization: what study supports this claim, and is that study applicable to your context?
- Contribute to the evidence base: document your own organization's experience carefully and share it, even in blog posts or case studies
- Support pre-registration when you run internal studies
- When an executive cites a statistic about AI productivity, ask: who funded the study, what task was studied, and was there a control group?
Misconceptions
- The evidence on AI and developer productivity is settled.
- The field is less than a decade old, dominated by short-term studies on narrow tasks, heavily influenced by industry funding, and almost entirely lacking independent replication. The honest summary is that some things seem to be true in some contexts; very little has been established across contexts.
- More published studies on a topic means we know more about it.
- More studies increase knowledge only if they are well-designed, independently replicated, and free of systematic bias. A hundred underpowered studies funded by the same industry stakeholders add less than one independent, pre-registered, adequately powered replication.
- Practitioner experience is a reliable substitute for empirical study.
- Practitioners are subject to the same cognitive biases as anyone else: confirmation bias, availability heuristics, and the tendency to remember vivid cases over representative ones. Systematic study is not a criticism of experience—it is a way to check whether experience is misleading you.
Check Understanding
A colleague says: "There are hundreds of studies showing AI tools improve developer productivity—the evidence is overwhelming." What is the most important methodological concern you would raise?
Publication bias: studies showing positive effects are more likely to be published than studies showing null or negative effects. A large number of published studies showing positive effects does not mean the weight of evidence is positive—it may mean that negative results are in file drawers. A well-conducted meta-analysis that corrects for publication bias (using funnel plots and trim-and-fill methods) would give a more reliable picture than a count of positive studies.
Name two types of long-term effects of AI coding tools that have not been adequately studied, and for each, describe what method you would use to study it and why.
Deskilling: the concern that relying on AI suggestions atrophies developers' ability to solve problems independently. This is hard to study with an experiment (you cannot ethically deprive developers of tools for years). A longitudinal observational study comparing skill assessments (code challenges, interview performance, debugging exercises) over time for heavy versus light AI tool users would be one approach—with careful attention to confounding by role, tenure, and tool type.
Code maintainability: whether AI-generated code is harder or easier to maintain over time. An MSR study following repositories that adopted AI coding tools and measuring maintainability metrics (churn rate, code complexity, time to understand unfamiliar code) over one to two years would provide evidence—again with caution about confounding and selection bias.
You are asked to advise your organization on whether to mandate AI coding tools for all developers. What two pieces of evidence would you most want before making a recommendation, and why?
Answers will vary, but strong answers will ask for: (1) evidence from a context similar to your organization (not students on artificial tasks), including effect sizes and confidence intervals; and (2) evidence on outcome measures that matter to your organization (defect rates, lead time, developer satisfaction), not just speed on isolated tasks. Additional credit for noting the importance of long-term evidence (not just 90-minute experiments) and for asking about conflict of interest in the cited studies.
The following sentence contains a reasoning error. Identify and correct it: "No study has found that AI tools deskill developers, so we can safely assume they don't."
This is absence of evidence used as evidence of absence. The lack of published studies showing deskilling could mean deskilling does not occur, or it could mean no one has run a well-powered long-term study designed to detect it. Given that most AI coding tool studies are short-term and focused on productivity, the absence of deskilling findings is unsurprising and uninformative. The correct statement is: "The evidence on whether AI tools cause deskilling is insufficient to draw conclusions."
Exercises
Research Agenda (20 minutes)
Write a one-page research agenda for one question you care about related to AI and software development. Include: the specific research question, the method you would use, the sample you would need, the outcome measure, two threats to validity, and one ethical consideration. Be honest about what your proposed design can and cannot establish.
Evidence Briefing (20 minutes)
Prepare a five-minute verbal briefing for a non-technical manager who wants to know "what does the research say about AI coding tools?" Write bullet points covering: what the evidence actually shows, what important questions are unanswered, and one specific caution about how to interpret the most widely cited statistic they are likely to have heard.
The Question Behind the Question (15 minutes)
A manager says: "We're spending $X per developer per year on AI coding tools. Are we getting our money's worth?" Restate this as three specific, answerable research questions. For each, name the method you would use and one data source you would need. Then explain which question matters most to the organization and why.