Qualitative Methods: Interviews and Surveys
In 2024, a team led by Jenny Liang surveyed 410 professional developers across industries about their experience with AI programming assistants [Liang2024]. They did not just ask "is this useful?" They asked about specific scenarios, frustrations, workarounds, and the kinds of tasks where developers had learned to distrust the tools. The result was a detailed map of where AI assistance actually helps and where it gets in the way—knowledge that would have been invisible to a study that only measured task completion times.
Qualitative methods are how you find out what is actually going on when numbers alone cannot tell you.
When Qualitative Methods Are the Right Choice
- When you do not yet know what to measure—qualitative work often precedes quantitative work by identifying the right variables
- When the phenomenon involves meaning, interpretation, or context that a number cannot capture
- When your sample is small or access is limited (surveys of 10 people are not statistically powerful, but interviews with 10 people can be rich)
- When you want to understand why a pattern exists, not just that it does
- When stakeholders need to be heard rather than counted
Designing Interviews
- Structured interviews use a fixed script;
semi-structured interviews
have a guide but allow follow-up;
unstructured interviews
follow the participant's lead
- Semi-structured is the most common in software engineering research: consistent enough to compare across participants, flexible enough to surface surprises
- Open questions invite narrative: "Tell me about a time when the AI suggestion was wrong in a way that surprised you"
- Closed questions invite classification: "On a scale of 1 to 5, how often do you accept AI suggestions without reading them?"
- Probing follows up on what a participant says: "You mentioned you don't trust it for security-related code—can you say more about that?"
- Avoid leading questions: "Don't you find it faster?" assumes the answer
- Avoid hypothetical questions: "Would you use X if Y?" People are poor predictors of their own behavior
Designing Surveys
- Surveys trade depth for breadth: you can reach many more people but get shallower data from each
- Likert scales ("Strongly agree" to "Strongly
disagree") are common but have known problems:
- People cluster toward the middle or toward the extremes
- The same label means different things to different people
- Use at least 5 points; 7 points gives more resolution
- Response bias: people answer differently depending on question order, framing, and what they think the researcher wants to hear
- Sampling strategy determines who you can
generalize to:
- Convenience samples (whoever responds to a Twitter post) are common and usually biased toward engaged, technical users
- Stratified samples deliberately include proportional representation of subgroups
- Pilot your survey with 3-5 people and revise before distributing
Thematic Analysis
- Thematic analysis is the standard approach for analyzing qualitative data
- Open coding: read through the data and tag segments
with descriptive labels (e.g., "trust in AI suggestions," "time saved on boilerplate")
- Use gerund coding: frame codes as actions rather than nouns, so "avoiding AI for security tasks" rather than "AI distrust"—this preserves the temporal and causal structure of what participants are actually doing [Thornberg2014]
- A theme is not a bucket for related quotes: it should express a claim you could
write as a sentence, such as "developers trust AI output for syntax but not for
logic" [Braun2019]
- Strong themes name tensions as well as patterns: "trusting the output while suspicious of the process" is more informative than "attitudes toward AI"
- Axial coding: group codes into higher-level themes
and examine relationships between them
- Ask which themes co-occur across participants, and why
- Look for conditions, contexts, and consequences: what makes a theme appear, in what setting, and with what effect on the participant
- If a code keeps appearing alongside very different themes, it may be doing double duty—split it and check whether each half is coherent on its own
- Separate participants' hypotheses and actions from their emotions and evaluations: "I think AI will replace junior devs" is a claim; "I hate how deskilled it makes me feel" is an affect
- Saturation: stop collecting data when new interviews or responses stop introducing new codes—a practical definition of "enough data"
- Code with a second researcher and measure intercoder reliability (Cohen's kappa) to reduce individual bias
- Keep an audit trail: document coding decisions so others can evaluate them
Experience Sampling
- Experience sampling prompts participants at random or scheduled intervals during their workday to report their current task, perceived state, or productivity
- It captures in-the-moment data that retrospective surveys miss: a developer's sense of how productive a day was often differs from what they reported each hour during that day
- A field study using hourly self-reports found that developers categorized into six types (focused, social, lone, balanced, leading, goal-oriented) and that productivity patterns varied substantially by time of day and individual [Meyer2017]
- When measuring subjective states, use validated instruments rather than writing your own questions: scales like SPANE (Scale of Positive and Negative Experience) have known psychometric properties; a five-item Likert question drafted for a single study does not
Triangulation
- Triangulation means combining multiple data sources or methods to increase confidence in a finding
- If your interviews, your survey, and your observation of developers at work all point to the same conclusion, that conclusion is more credible than if only one source supports it
- Types of triangulation: data (multiple sources), investigator (multiple coders), method (qualitative + quantitative), theory (multiple interpretive frameworks)
Common Mistakes
- Leading questions: "Most developers find AI helpful—would you agree?"
- Convenience sampling: surveying your own users, your own colleagues, or respondents to a post in a community that already agrees with you
- Confirmation bias: stopping analysis when you have enough data to support your hypothesis rather than when you reach saturation
- Overgeneralization: "Developers find AI tools frustrating" from a study of 20 junior developers at one company
- Ignoring non-response: if only 10% of people surveyed respond, the 90% who did not may have very different views
Misconceptions
- Qualitative research is just asking people what they think.
- Rigorous qualitative work involves systematic data collection, structured analysis (open coding, axial coding), documentation of decisions, and assessment of intercoder reliability. Asking a few colleagues over lunch is not a study.
- More interviews always produce better qualitative results.
- Depth matters more than volume. A study with twenty rich, well-analyzed interviews reaching saturation is more informative than one with a hundred superficial ones that never probe below the surface.
- A high response count makes a survey representative.
- Representativeness depends on who responds relative to who you want to generalize to—not on the raw number of responses. A million responses from a self-selected online audience is still a biased convenience sample.
- Thematic analysis is subjective and therefore unreliable.
- The subjectivity of interpretation is a known feature, not a flaw: qualitative researchers manage it through audit trails, multiple coders, and transparent documentation of how codes and themes were derived.
- Any descriptive label counts as a valid code.
- A code named "AI" or "trust" is a noun bucket, not an analysis. Good codes capture what a participant is doing: "switching off AI suggestions after a bad experience" tells you something; "negative AI attitude" does not. The same principle applies to themes: a theme that cannot be expressed as a claim is a filing category, not a finding.
Check Understanding
What is the difference between open coding and axial coding in thematic analysis?
Open coding is the first pass through the data, where you tag individual segments with descriptive labels close to what the participant actually said. Axial coding is a second-order process where you group those labels into higher-level themes and begin to examine how the themes relate to each other. Open coding is inductive and close to the data; axial coding is more interpretive and moves toward an explanatory structure.
A researcher surveyed developers by posting a link in a popular programming subreddit and got 800 responses. They concluded that "the majority of developers are satisfied with AI coding tools." Identify two specific problems with this conclusion.
First, the sample is a convenience sample skewed toward developers who actively participate in that community, who are likely more technically engaged and more likely to already use AI tools than average. Second, the phrasing "majority of developers" implies generalizability to a population (all developers) that the sample does not represent. The conclusion should be "the majority of respondents to this survey were satisfied"—a much weaker claim. A third problem: people who are satisfied are more likely to respond to a survey about satisfaction (self-selection bias).
Why is "Would you use an AI coding assistant if it were integrated into your IDE?" a poor interview question?
It is a hypothetical question, and people are unreliable predictors of their own future behavior. Developers may say yes because the scenario sounds appealing, but their actual behavior when faced with the tool may differ significantly. A better question asks about past or present behavior: "Tell me about the last time you used an AI coding assistant. What did you do with the suggestion it gave you?"
The following interview question contains a flaw. Identify it and rewrite the question: "Given that AI tools can generate boilerplate code automatically, how much time do you think you save using them?"
The question is leading: it presupposes that AI tools save time ("given that they can generate boilerplate code automatically") and asks the participant to quantify that saving. A respondent who does not save time, or who finds the tools slow them down, is implicitly pushed toward a positive answer. A better version: "When you use AI coding tools, what effect do they have on how long tasks take you? Can you give me a specific recent example?"
Exercises
Write an Interview Guide (20 minutes)
Write a semi-structured interview guide (6-8 questions plus follow-up probes) for a study on how developers decide when to accept or reject AI code suggestions. Include at least one open question, one probe, and identify one question from your first draft that was leading and explain how you revised it.
Code This Excerpt (15 minutes)
Apply open coding to the following interview excerpt. Identify at least four distinct codes, quote the specific text that led to each code, and then group your codes into two higher-level themes.
"I use it mostly for stuff I already know how to do—like if I need to write a regex or remember the syntax for something in a library I don't use often. But for the core logic of whatever I'm building, I don't trust it. It'll give you something that looks right but misses an edge case, and you won't notice until production. I've started just not using it for anything security-related at all."
Evaluate a Survey (15 minutes)
Find the methods section of Liang et al. 2024 ("A Large-Scale Survey on the Usability of AI Programming Assistants") or another published survey of developer experience with AI tools. Identify the sampling strategy, the response rate (if reported), and one specific design choice that reduces bias. Then identify one limitation the authors acknowledge and one they do not.