Starting Cheaply

In the early 2000s, Dennis Wixon's team at Microsoft was evaluating software products with the standard usability protocol: run all participants through the complete test, compile observations, write a report, wait for the next release cycle to fix anything. The problem was obvious once they looked for it. The second participant got stuck in the same place as the first. The third participant got stuck there too. By the time all eight participants had finished, six of them had encountered the same confusing label on the same menu and abandoned the task in the same way— while the team sat in the observation room watching it happen and carefully noting it down for the report they would write afterward.

The team's response was blunt: fix the label between sessions [Medlock2002]. If participant one demonstrates a serious problem, address it before participant two arrives. The remaining participants can then find different problems instead of re-enacting the first one. This is the Rapid Iterative Testing and Evaluation (RITE) method, and it illustrates the argument of this appendix. The goal of research is to learn things that improve decisions. Elaborate protocols can get in the way of that goal just as surely as doing nothing.

What "Enough" Actually Means

Finding Participants Without a Budget

Running a Think-Aloud Session

Participants will often look to you for reassurance that they are doing it correctly. They are not doing it correctly or incorrectly: they are showing you what actually happens. "You're doing great, just keep going" is a reasonable response to almost anything.

The RITE Method

Analyzing What You Found

Presenting Findings to Skeptics

What You Cannot Claim

Misconceptions

Five participants is enough for any research question.
Five participants is typically enough to identify most major problems in a focused usability study under formative conditions. It is not enough to make quantitative claims about any population. The five-participant guideline applies specifically to the question "what problems exist in this workflow?"—not to "how many developers are affected?" or "does this tool improve productivity?"
Guerrilla research is just asking people what they think.
Systematic observation, structured task design, consistent note-taking, and disciplined analysis are what separate research from anecdote. Asking a few colleagues over lunch whether they like the new tool is not a study; it is a poll of people who are motivated to be agreeable. The methods described in this appendix are informal, not undisciplined.
If participants complete the task, there is no problem.
Participants often complete tasks while confused, backtracking, and making incorrect assumptions that happen not to matter for that particular task. Task completion rate is not a proxy for user experience quality. What you learn from a think-aloud is not whether participants completed the task but what happened while they did.
Informal findings should not be shared because they will embarrass you.
Informal findings shared honestly, with limitations stated, are defensible. What attracts criticism is overclaiming. "We watched five people and observed this" is a reasonable contribution to a decision. "Our research proves this tool improves team productivity" is not, regardless of how formally the study is presented.
You should tell participants what you are studying so they can help.
Telling participants your hypothesis primes them to look for what you are already looking for, which makes it more likely they report what you expect rather than what is actually affecting them. Give participants tasks, not hypotheses. "Here is what we are interested in" is appropriate at the end of a session; it is not appropriate at the beginning.

Check Understanding

A team wants to compare two AI coding assistants to decide which to adopt. They have five willing participants and half a day. Should they use the RITE method? Why or why not?

No. The RITE method is designed for formative evaluation of a single product, where you observe a problem and fix it before the next session. For a comparison between two products, you need each participant to use both tools under comparable conditions (a within-subjects design) or different participants to use each tool (a between-subjects design). Using RITE for a comparison would mean changing one of the tools between sessions, making those sessions incomparable. A more appropriate approach: have each participant complete equivalent tasks with both tools in counterbalanced order, take notes on where they struggle with each, and report what you observed without claiming statistical validity.

A developer ran a five-session think-aloud study and found that three of five participants got confused at the same step. She wants to write that "60% of developers will experience this problem." What is wrong with this claim, and how should she rewrite it?

The fraction 3/5 is mathematically 60%, but presenting it as a percentage implies a precision that a sample of five cannot support: the 95% confidence interval for a proportion estimated from five observations runs from approximately 15% to 95%. More fundamentally, "developers" implies generalizability to a population that a convenience sample of five does not represent. An honest rewrite: "In three of five sessions, participants had difficulty at this step." That is a weaker claim, but it is accurate—and it is still useful for deciding whether to redesign the step.

Why is a think-aloud protocol often more informative than asking participants what they found difficult after the session ends?

Post-task recall is unreliable in several ways. Participants forget specific moments of confusion, especially when they eventually succeeded: frustration that lasted three minutes feels minor in retrospect if the task was completed. Participants tend to summarize overall impressions rather than describe specific events, and they often rationalize behavior they engaged in without much deliberation. The think-aloud captures real-time reactions, including confusion that was later resolved and incorrect assumptions that happened not to cause failure— exactly the phenomena you need to understand if you want to reduce friction in the workflow.

A manager hears about your five-session study and asks: "How do you know this applies to all our developers?" What is the correct response?

Concede the limitation immediately: "We don't. Five sessions tells us these problems exist; it doesn't tell us how common they are across the full team." Then redirect to the decision: "What we know is that these are real problems we observed. A larger study would tell us how widespread they are. If that matters before we decide, we should plan one. If we're comfortable that these are worth fixing regardless of frequency, the next step is to fix them." Defending the generalizability of a five-person convenience sample is a losing argument. Framing the evidence accurately and asking what decision it supports is more productive.

Exercises

Design a Think-Aloud Session (20 minutes)

Your team is considering switching from your current code review tool to a new one. Write a plan for a 45-minute think-aloud session: specify three tasks you would ask participants to complete, write a short recruitment message you would send to five colleagues in a different team, and design a simple note-taking template that would help you record hesitations, errors, and successful completions without overwhelming you during the session itself. Identify one element of your design that might bias your observations and describe how you would address it.

Write the Findings (20 minutes)

You observed four developers using an AI coding assistant for code review tasks. One accepted every suggestion without reading it. One read each suggestion carefully and accepted about half. One switched to a different suggestion mode halfway through, after becoming frustrated with the default. One disabled the assistant after the second suggestion and completed the task manually. Write the findings section of a two-page report aimed at an engineering manager who is deciding whether to mandate the tool for the team. Include: what you observed, what the observations suggest, what you cannot conclude from four sessions, and what you would recommend as a next step.

Respond to a Skeptic (15 minutes)

A colleague says: "This is too informal to be useful. If we're going to study whether this tool helps, we need to do it properly—a controlled experiment, random assignment, statistical analysis. Anything less is just anecdote." Write two paragraphs in response. The first should acknowledge the legitimate concern behind this position. The second should explain under what conditions informal evidence is more useful than waiting for the resources to conduct a formal study.