Conclusion: What to Do Next

In 2001, Mark Harman and Bryan Jones published a paper proposing that software engineering problems could be treated as search problems—find the test suite that maximizes coverage, or the configuration that minimizes defects—and solved with optimization algorithms [Harman2001]. At the time, this seemed like an interesting theoretical curiosity. Two decades later, search-based software engineering is a thriving subfield with annual conferences, graduate programs, and practical tools. It exists because someone decided to ask a question that had not previously been asked.

The questions researchers ask shape what the field knows. The questions practitioners ask shape what gets funded and studied. You now have the tools to ask better questions.

Matching Research Questions to Methods

What We Know About AI and Developer Productivity

What Is Being Oversold

Staying Current Without Drowning

What Good Research Would Look Like

How to Push for Better Evidence

Misconceptions

The evidence on AI and developer productivity is settled.
The field is less than a decade old, dominated by short-term studies on narrow tasks, heavily influenced by industry funding, and almost entirely lacking independent replication. The honest summary is that some things seem to be true in some contexts; very little has been established across contexts.
More published studies on a topic means we know more about it.
More studies increase knowledge only if they are well-designed, independently replicated, and free of systematic bias. A hundred underpowered studies funded by the same industry stakeholders add less than one independent, pre-registered, adequately powered replication.
Practitioner experience is a reliable substitute for empirical study.
Practitioners are subject to the same cognitive biases as anyone else: confirmation bias, availability heuristics, and the tendency to remember vivid cases over representative ones. Systematic study is not a criticism of experience—it is a way to check whether experience is misleading you.

Check Understanding

A colleague says: "There are hundreds of studies showing AI tools improve developer productivity—the evidence is overwhelming." What is the most important methodological concern you would raise?

Publication bias: studies showing positive effects are more likely to be published than studies showing null or negative effects. A large number of published studies showing positive effects does not mean the weight of evidence is positive—it may mean that negative results are in file drawers. A well-conducted meta-analysis that corrects for publication bias (using funnel plots and trim-and-fill methods) would give a more reliable picture than a count of positive studies.

Name two types of long-term effects of AI coding tools that have not been adequately studied, and for each, describe what method you would use to study it and why.

Deskilling: the concern that relying on AI suggestions atrophies developers' ability to solve problems independently. This is hard to study with an experiment (you cannot ethically deprive developers of tools for years). A longitudinal observational study comparing skill assessments (code challenges, interview performance, debugging exercises) over time for heavy versus light AI tool users would be one approach—with careful attention to confounding by role, tenure, and tool type.

Code maintainability: whether AI-generated code is harder or easier to maintain over time. An MSR study following repositories that adopted AI coding tools and measuring maintainability metrics (churn rate, code complexity, time to understand unfamiliar code) over one to two years would provide evidence—again with caution about confounding and selection bias.

You are asked to advise your organization on whether to mandate AI coding tools for all developers. What two pieces of evidence would you most want before making a recommendation, and why?

Answers will vary, but strong answers will ask for: (1) evidence from a context similar to your organization (not students on artificial tasks), including effect sizes and confidence intervals; and (2) evidence on outcome measures that matter to your organization (defect rates, lead time, developer satisfaction), not just speed on isolated tasks. Additional credit for noting the importance of long-term evidence (not just 90-minute experiments) and for asking about conflict of interest in the cited studies.

The following sentence contains a reasoning error. Identify and correct it: "No study has found that AI tools deskill developers, so we can safely assume they don't."

This is absence of evidence used as evidence of absence. The lack of published studies showing deskilling could mean deskilling does not occur, or it could mean no one has run a well-powered long-term study designed to detect it. Given that most AI coding tool studies are short-term and focused on productivity, the absence of deskilling findings is unsurprising and uninformative. The correct statement is: "The evidence on whether AI tools cause deskilling is insufficient to draw conclusions."

Exercises

Research Agenda (20 minutes)

Write a one-page research agenda for one question you care about related to AI and software development. Include: the specific research question, the method you would use, the sample you would need, the outcome measure, two threats to validity, and one ethical consideration. Be honest about what your proposed design can and cannot establish.

Evidence Briefing (20 minutes)

Prepare a five-minute verbal briefing for a non-technical manager who wants to know "what does the research say about AI coding tools?" Write bullet points covering: what the evidence actually shows, what important questions are unanswered, and one specific caution about how to interpret the most widely cited statistic they are likely to have heard.

The Question Behind the Question (15 minutes)

A manager says: "We're spending $X per developer per year on AI coding tools. Are we getting our money's worth?" Restate this as three specific, answerable research questions. For each, name the method you would use and one data source you would need. Then explain which question matters most to the organization and why.