Running Studies in Your Organization
In 2014, Andrew Begel and Nachiappan Nagappan surveyed both researchers and engineers at Microsoft to find out which questions practitioners most want answered and which research methods they find credible [Begel2014]. The top questions were practical: how do I hire the right people? How do I improve team productivity? How do I reduce technical debt? The most credible methods, in practitioners' view, were those that used their own data from their own systems. Practitioners did not want controlled experiments or thematic analyses of interviews with strangers—they wanted evidence from people like them, in contexts like theirs.
You are probably one of those practitioners.
Ethical Considerations
- Research on people requires ethical oversight, even when it is "internal" research on your own employees
- Informed consent means participants know
what data is being collected, how it will be used, and that they can
withdraw without penalty
- "It's in the employment contract" is not informed consent for a study
- Privacy: commit data, ticket data, and code review data can reveal individual performance, health issues (through absence patterns), and dissatisfaction
- Harm: studies that measure individual productivity can be used to justify performance management decisions, creating pressure on participants to perform, not just participate
- "It's just internal data" is not a free pass: the fact that you already have access to data does not mean using it for research is harmless
- When monitoring developers — even with their consent — store data locally or give participants full control over what is shared; some developers will leave an organization rather than submit to monitoring they perceive as surveillance
- Monitoring that erodes the psychological safety that predicts team performance defeats the purpose of the study before it begins
- Consider involving your legal, HR, and ethics teams before collecting data on individuals
Practical Study Designs for Practitioners
- Most practitioners cannot run a randomized controlled trial: no random assignment, no control group, limited time, limited budget
- Practical alternatives:
- Retrospective analysis: study what already happened using data you already have (commit logs, ticket history, review comments)
- Interrupted time series: use a policy or tool change as a natural experiment
- Internal survey: structured survey of your own team, with clear scope and honest limitations
- A/B testing: if you are deploying software, random assignment of users to conditions is often feasible and already part of your infrastructure
- Be honest with yourself about what design you actually have, not what you wish you had
Identifying Waste as a Study Entry Point
- If your organization is skeptical of productivity measurement but open to process improvement, frame the study around waste identification rather than output measurement
- Ask: what work consumes effort without delivering value? The answer is usually less politically charged than "who is productive?" and often surfaces actionable interventions
- Sedano et al.'s nine waste categories (building wrong features, backlog mismanagement, rework, unnecessary complexity, cognitive load, psychological distress, waiting, knowledge loss, ineffective communication) are a useful starting checklist [Sedano2017]
- Waste removal is often easier to prioritize than productivity improvement because it is clear what "done" looks like
Scoping a Study
- A study that tries to answer "does AI make our developers more productive?" cannot be answered without first answering what you mean by productive, which developers, at which tasks, over what time period
- A scoped question: "Did the time to first review decrease for pull requests submitted after we deployed Copilot, compared to the three months before?"
- Scoping criteria: one team, one metric, one comparison, one time period
- If your scoped question does not answer what stakeholders actually need to know, that is important information—surface it before you run the study
Defining What to Measure: GQM
- Knowing you want to study "productivity" is not enough: you need a specific, defensible measurement before you collect any data
- The Goal-Question-Metric (GQM) approach provides a
structured path from intent to measurement [Basili1994]:
- Define the goal: what object are you studying, what property of that object, from whose viewpoint, and in what context?
- Generate the questions whose answers would tell you whether the goal was achieved
- Identify the metric — the specific, operationalized number — that answers each question
- Example: goal = "understand the impact of AI coding assistants on
delivery speed, from the engineering manager's viewpoint, in our
current sprint process"
- Question 1: does time-to-merge change after adoption? → metric: median hours from PR open to merge, per sprint
- Question 2: does the volume of completed work change? → metric: story points delivered per sprint
- Question 3: does defect rate change? → metric: bugs filed within 14 days of a PR merge
- GQM makes implicit assumptions explicit: the choice of metric embeds a claim that this number measures what you care about — and that claim can be challenged before you collect a single data point
- If you cannot identify a metric for a question, either the question is not operationalizable (and should be addressed qualitatively) or the goal needs to be refined
Iterating Your Study Design
- A first study rarely answers the question you thought you were asking: metrics turn out to be noisier than expected, the comparison group was not comparable, or the interesting result was a side-effect you did not plan to measure
- Invest no more than a quarter of your budget in the first run; use what you learn to refine the question, metrics, and design before committing the rest [Wohlin2000]
- Common refinements after a first run:
- A variable you treated as a parameter (held constant) actually varied and affected results — add it as a blocking variable or factor
- Your primary metric has low construct validity — replace or supplement it
- The comparison was unfair — one group had more experience, better tooling, or easier tasks
- Negative results from a first run are often more informative than a clean positive: they tell you what the interesting question actually is
- Document each iteration, including what changed and why — this is the study history that makes later replication and interpretation possible
Sharing Results Responsibly
- Every study has limitations; state them before stakeholders ask
- Distinguish between "we found no effect" and "our study was not designed to detect an effect"
- Present uncertainty: confidence intervals, effect sizes, and sample sizes belong in results presented to management, not just in appendices
- Resist pressure to oversell: a negative result honestly reported is more valuable than a positive result that cannot be replicated
- Consider who will see the results and how they might be misused
Working with Legal, HR, and Management
- Get legal and HR involved before you collect individual-level data; retroactive approval is harder than proactive review
- Management may want to use study results for performance evaluation; clarify before you start whether this is permitted and communicate it to participants
- Document your study design, analysis plan, and data handling before you begin—this creates accountability and reduces the temptation to change the analysis after seeing results
- If you cannot publish externally, write an internal report that follows the same standards you would apply to a published paper
What to Do When You Cannot Publish
- Internal research is valuable even without external publication
- Write a memo or report with methods, results, limitations, and conclusions
- Share it with the team that was studied, not just the leadership team that commissioned the study
- Archive the data and analysis scripts so the study can be revisited or extended
- Be aware that internal research may be subject to legal discovery in litigation, especially if it touches on employment decisions
Misconceptions
- Internal data has no ethical constraints because the company already owns it.
- Collecting data for operational purposes (paying people, tracking bugs) does not authorize using it for research purposes. The intended use of data matters, and research use requires separate consideration of consent, privacy, and potential harm.
- A/B testing is a reliable all-purpose tool for internal studies.
- A/B testing works well for product features where random assignment is natural and outcomes are measured automatically. Most questions about developer productivity—does code review help, does pair programming improve quality—do not fit this template.
- Measuring developer productivity motivates developers to be more productive.
- Metrics change behavior, but not always in the intended direction. Developers subjected to daily productivity tracking often become less autonomous and more stressed — and may optimize the metric rather than the work [Sadowski2019]. The purpose of measurement is to improve outcomes. If measurement degrades the psychological conditions that produce good work, it defeats itself.
- Negative results from an internal study are not worth reporting.
- A well-run study that finds no effect is valuable: it prevents the organization from acting on a false assumption and contributes to a more accurate picture of what actually works. Suppressing negative results creates the same file-drawer problem inside organizations that it creates in academic publishing.
Check Understanding
Why is "it's in the employment contract" insufficient as a basis for informed consent in a workplace study?
Informed consent requires that participants know specifically what is being studied, what data is collected, how it will be used, and that they can withdraw without consequence. An employment contract grants broad data collection rights as a condition of employment—participants cannot meaningfully withdraw without leaving their job. This creates a power imbalance that undermines the voluntariness of consent. Research ethics require that consent be specific, informed, and genuinely voluntary.
A team wants to study the effect of mandatory code review on defect rates. They have six months of data before the policy and six months after. What design are they using, and what is its main threat to validity?
They are using an interrupted time series design. The main threat to validity is that other things changed at the same time as the policy: team composition, codebase complexity, the types of features being built, or external tooling. Without a control group that did not change its code review policy, it is impossible to separate the effect of code review from other concurrent changes. A stronger design would include a comparison team.
A manager asks you to run a study on "which developers are least productive" using commit data. Identify two specific ethical concerns and one way to address each.
First, using commit data as a productivity measure has low construct validity and can be gamed—measuring it for performance purposes creates perverse incentives (more commits, not better work). Address this by using outcome metrics (defect rates, feature delivery time) rather than activity metrics. Second, identifying "least productive" individuals creates harm risk: participants may face disciplinary action, feel surveilled, or experience increased anxiety. Address this by committing in advance that the study results will be used for process improvement, not performance management, and communicating this to all participants.
The following study report contains a presentational problem. Identify it: "We found that time-to-merge decreased by 18% after deploying the tool (p = 0.02). We recommend all teams adopt this tool immediately."
The recommendation does not account for the limitations of the study or the uncertainty in the estimate. An 18% decrease with p = 0.02 is statistically significant but the effect size, confidence interval, and practical significance are not reported. More importantly, "all teams" is an overgeneralization—the study covers one team, one tool, one metric, and one time period. A responsible recommendation would specify the conditions under which adoption seems warranted and acknowledge the limitations of generalizing from a single internal study.
Exercises
Ethics Review (15 minutes)
You want to study whether weekly one-on-one meetings between managers and developers improve developer satisfaction and retention. Write a one-page ethics review that identifies: what data you would collect, the risks to participants, how you would obtain informed consent, and how you would protect participant privacy against retaliation if a participant rates their manager poorly.
Scope It Down (15 minutes)
Take one of the following broad questions and write a scoped version that could be answered with data available to a typical software team in three months. Specify the metric, the comparison, and the time period:
- Does using an AI coding assistant improve code quality?
- Does test-driven development reduce bug rates?
- Does remote work affect team collaboration?
Write the Limitations (20 minutes)
You ran an internal A/B test: 30 developers used an AI code review tool for 8 weeks and 30 did not; the treatment group's time-to-merge decreased by 22%. Write the "Limitations" section of the internal report. Address at least: sample size and power, external validity, alternative explanations, and one ethical consideration.