Running Studies in Your Organization

In 2014, Andrew Begel and Nachiappan Nagappan surveyed researchers and engineers at Microsoft to find out which questions practitioners most want answered [Begel2014]. The top questions were practical: how do I hire the right people? How do I improve team productivity? How do I reduce technical debt? The most credible methods, in practitioners' view, were those that used their own data from their own systems. Practitioners did not want controlled experiments or thematic analyses of interviews with strangers; they wanted evidence from people like them, in contexts like theirs. We created these lessons because you probably do too.

Ethical Considerations

Practical Study Designs for Practitioners

Identifying Waste as a Study Entry Point

Scoping a Study

Defining What to Measure: GQM

Iterating Your Study Design

Sharing Results Responsibly

Working with Legal, HR, and Management

What to Do When You Cannot Publish

Misconceptions

Internal data has no ethical constraints because the company already owns it.
Collecting data for operational purposes does not authorize using it for research purposes. The intended use of data matters, and research use requires separate consideration of consent, privacy, and potential harm.
A/B testing is a reliable all-purpose tool for internal studies.
A/B testing works well for product features where random assignment is natural and outcomes are measured automatically. Most questions about developer productivity do not fit this description.
Measuring developer productivity motivates developers to be more productive.
Developers subjected to daily productivity tracking often become less autonomous and more stressed, and may optimize the metric rather than the work. If measurement degrades the psychological conditions that produce good work, it defeats itself.
Negative results from an internal study are not worth reporting.
A well-run study that finds no effect is valuable: it prevents the organization from acting on a false assumption and contributes to a more accurate picture of what actually works. Suppressing negative results creates the same file-drawer problem inside organizations that it creates in academic publishing.

Check Understanding

Why is "it's in the employment contract" insufficient as a basis for informed consent in a workplace study?

Informed consent requires that participants know specifically what is being studied, what data is collected, how it will be used, and that they can withdraw without consequence. An employment contract grants broad data collection rights as a condition of employment: participants cannot meaningfully withdraw without leaving their job. This creates a power imbalance that undermines the voluntariness of consent. Research ethics require that consent be specific, informed, and genuinely voluntary.

A team wants to study the effect of mandatory code review on defect rates. They have six months of data before the policy and six months after. What design are they using, and what is its main threat to validity?

They are using an interrupted time series design. The main threat to validity is that other things changed at the same time as the policy. Without a control group that did not change its code review policy, it is impossible to separate the effect of code review from other concurrent changes. A stronger design would include a comparison team, but that might not be possible.

A manager asks you to run a study on "which developers are least productive" using commit data. Identify two specific ethical concerns and one way to address each.

First, using commit data as a productivity measure has low construct validity and can be gamed. Address this by using outcome metrics (e.g., defect rates or feature delivery time) rather than activity metrics. Second, identifying "least productive" individuals creates harm risk: participants may face disciplinary action, feel surveilled, or experience increased anxiety. Address this by committing in advance that the study results will be used for process improvement, not performance management, and communicating this to all participants.

The following study report contains a presentational problem. Identify it: "We found that time-to-merge decreased by 18% after deploying the tool (p = 0.02). We recommend all teams adopt this tool immediately."

The recommendation does not account for the limitations of the study or the uncertainty in the estimate. An 18% decrease with p = 0.02 is statistically significant, but the effect size, confidence interval, and practical significance are not reported. More importantly, "all teams" is an overgeneralization: the study covers one team, one tool, one metric, and one time period. A responsible recommendation would specify the conditions under which adoption seems warranted and acknowledge the limitations of generalizing from a single internal study.

Exercises

Ethics Review (15 minutes)

You want to study whether weekly one-on-one meetings between managers and developers improve developer satisfaction and retention. Write a one-page ethics review that identifies what data you would collect, the risks to participants, how you would obtain informed consent, and how you would protect participant privacy against retaliation if a participant rates their manager poorly.

Scope It Down (15 minutes)

Take one of the following broad questions and write a scoped version that could be answered in three months with data available to a typical software team. Specify the metric, the comparison, and the time period.

Write the Limitations (20 minutes)

You ran an internal A/B test: 30 developers used an AI code review tool for 8 weeks and 30 did not, and the treatment group's time-to-merge decreased by 22%. Write the "Limitations" section of the internal report. Address sample size and power, external validity, alternative explanations, and ethical considerations.