Running Studies in Your Organization

In 2014, Andrew Begel and Nachiappan Nagappan surveyed both researchers and engineers at Microsoft to find out which questions practitioners most want answered and which research methods they find credible [Begel2014]. The top questions were practical: how do I hire the right people? How do I improve team productivity? How do I reduce technical debt? The most credible methods, in practitioners' view, were those that used their own data from their own systems. Practitioners did not want controlled experiments or thematic analyses of interviews with strangers—they wanted evidence from people like them, in contexts like theirs.

You are probably one of those practitioners.

Ethical Considerations

Practical Study Designs for Practitioners

Identifying Waste as a Study Entry Point

Scoping a Study

Defining What to Measure: GQM

Iterating Your Study Design

Sharing Results Responsibly

Working with Legal, HR, and Management

What to Do When You Cannot Publish

Misconceptions

Internal data has no ethical constraints because the company already owns it.
Collecting data for operational purposes (paying people, tracking bugs) does not authorize using it for research purposes. The intended use of data matters, and research use requires separate consideration of consent, privacy, and potential harm.
A/B testing is a reliable all-purpose tool for internal studies.
A/B testing works well for product features where random assignment is natural and outcomes are measured automatically. Most questions about developer productivity—does code review help, does pair programming improve quality—do not fit this template.
Measuring developer productivity motivates developers to be more productive.
Metrics change behavior, but not always in the intended direction. Developers subjected to daily productivity tracking often become less autonomous and more stressed — and may optimize the metric rather than the work [Sadowski2019]. The purpose of measurement is to improve outcomes. If measurement degrades the psychological conditions that produce good work, it defeats itself.
Negative results from an internal study are not worth reporting.
A well-run study that finds no effect is valuable: it prevents the organization from acting on a false assumption and contributes to a more accurate picture of what actually works. Suppressing negative results creates the same file-drawer problem inside organizations that it creates in academic publishing.

Check Understanding

Why is "it's in the employment contract" insufficient as a basis for informed consent in a workplace study?

Informed consent requires that participants know specifically what is being studied, what data is collected, how it will be used, and that they can withdraw without consequence. An employment contract grants broad data collection rights as a condition of employment—participants cannot meaningfully withdraw without leaving their job. This creates a power imbalance that undermines the voluntariness of consent. Research ethics require that consent be specific, informed, and genuinely voluntary.

A team wants to study the effect of mandatory code review on defect rates. They have six months of data before the policy and six months after. What design are they using, and what is its main threat to validity?

They are using an interrupted time series design. The main threat to validity is that other things changed at the same time as the policy: team composition, codebase complexity, the types of features being built, or external tooling. Without a control group that did not change its code review policy, it is impossible to separate the effect of code review from other concurrent changes. A stronger design would include a comparison team.

A manager asks you to run a study on "which developers are least productive" using commit data. Identify two specific ethical concerns and one way to address each.

First, using commit data as a productivity measure has low construct validity and can be gamed—measuring it for performance purposes creates perverse incentives (more commits, not better work). Address this by using outcome metrics (defect rates, feature delivery time) rather than activity metrics. Second, identifying "least productive" individuals creates harm risk: participants may face disciplinary action, feel surveilled, or experience increased anxiety. Address this by committing in advance that the study results will be used for process improvement, not performance management, and communicating this to all participants.

The following study report contains a presentational problem. Identify it: "We found that time-to-merge decreased by 18% after deploying the tool (p = 0.02). We recommend all teams adopt this tool immediately."

The recommendation does not account for the limitations of the study or the uncertainty in the estimate. An 18% decrease with p = 0.02 is statistically significant but the effect size, confidence interval, and practical significance are not reported. More importantly, "all teams" is an overgeneralization—the study covers one team, one tool, one metric, and one time period. A responsible recommendation would specify the conditions under which adoption seems warranted and acknowledge the limitations of generalizing from a single internal study.

Exercises

Ethics Review (15 minutes)

You want to study whether weekly one-on-one meetings between managers and developers improve developer satisfaction and retention. Write a one-page ethics review that identifies: what data you would collect, the risks to participants, how you would obtain informed consent, and how you would protect participant privacy against retaliation if a participant rates their manager poorly.

Scope It Down (15 minutes)

Take one of the following broad questions and write a scoped version that could be answered with data available to a typical software team in three months. Specify the metric, the comparison, and the time period:

Write the Limitations (20 minutes)

You ran an internal A/B test: 30 developers used an AI code review tool for 8 weeks and 30 did not; the treatment group's time-to-merge decreased by 22%. Write the "Limitations" section of the internal report. Address at least: sample size and power, external validity, alternative explanations, and one ethical consideration.