A controlled experiment in which participants or users are randomly
assigned to one of two variants to measure which performs better on a
defined metric.
affinity mapping
An analysis technique in which individual observations are written on
separate notes and then grouped into clusters that share an underlying
cause or theme, with each cluster named as a claim rather than a label.
agile manifesto
The 2001 statement of four values and twelve principles for agile
software development, emphasizing individuals and interactions, working
software, customer collaboration, and responding to change.
alternative hypothesis
The hypothesis that an effect exists; denoted H₁. Accepted when the
null hypothesis is rejected.
axial coding
A stage in grounded theory analysis in which open codes are grouped
into higher-level themes and the relationships between categories are
examined.
B
blinding
An experimental design feature in which participants (single-blind),
experimenters, or both (double-blind) do not know which condition a
participant is in, reducing expectation bias.
blocking variable
A variable known to affect the outcome but not of experimental interest;
controlled by grouping experimental units into blocks of similar values
before randomization, guaranteeing that the nuisance factor's effect is
distributed evenly across treatment conditions.
Bonferroni correction
A method for adjusting significance thresholds when performing multiple
statistical tests, dividing the desired alpha level by the number of
comparisons to control the family-wise error rate.
boredom effect
A threat to internal validity in which participants' performance or
attention degrades over time due to fatigue or disengagement, rather
than any effect of the treatment.
C
change failure rate
The proportion of deployments that result in degraded service or
require remediation; one of the four DORA metrics used to measure
software delivery performance.
claim
An assertion about the world that a study attempts to support or refute
through evidence.
Cohen's d
A measure of effect size for comparing two means, expressed in units of
pooled standard deviation. Conventionally, d = 0.2 is small, d = 0.5 is
medium, and d = 0.8 is large.
conclusion validity
The degree to which a study's statistical analysis is appropriate and
adequately powered to detect the effect it reports; threatened by
underpowered designs, violated test assumptions, and uncorrected
multiple comparisons.
confirmation bias
The tendency to search for, interpret, and recall information in ways
that confirm pre-existing beliefs, distorting the analysis of data.
confirmatory analysis
Analysis designed to test a pre-specified hypothesis using data
collected expressly for that purpose, as opposed to exploratory analysis
which searches for patterns without a prior hypothesis.
confounding variable
A variable that is associated with both the independent variable and the
outcome, potentially explaining an observed relationship without there
being a direct causal link.
construct validity
The degree to which a measurement actually captures the concept it is
intended to represent.
control group
The group in an experiment that does not receive the treatment; used as
a baseline for comparison.
controlled experiment
A study in which one or more independent variables are manipulated while
other factors are held constant, allowing causal claims to be made.
convenience sampling
Selecting participants based on ease of access rather than random
selection, making the sample easier to recruit but harder to generalize
from.
D
dependent variable
The outcome measured in an experiment; the variable expected to change
in response to manipulation of the independent variable.
deployment frequency
How often an organization successfully deploys code to production; one
of the four DORA metrics used to measure software delivery performance.
difference-in-differences
A quasi-experimental design that compares the change in outcome for a
treated group to the change for a control group over the same period.
double blind study
An experiment in which neither the participants nor the experimenters
interacting with them know which condition each participant is in,
reducing both participant expectation effects and experimenter bias.
E
effect size
A measure of the magnitude of an effect, independent of sample size.
Examples include Cohen's d, odds ratio, and Pearson's r.
empirical software engineering
The subfield of software engineering that uses systematic data
collection and analysis to study how software is built, maintained, and
used.
evidence
Data or observations gathered through study to support or refute a
claim, which varies in quality and reliability depending on how it was
collected.
experience sampling method
A research technique in which participants are prompted at random or
scheduled intervals during their normal activities to report their current
task, state, or perceptions, capturing in-the-moment data rather than
retrospective recall.
exploratory analysis
Analysis that searches for patterns, relationships, or hypotheses in
data without pre-specified predictions; used to generate rather than test
hypotheses.
external validity
The degree to which study findings generalize beyond the specific sample,
setting, and task studied.
F
file drawer problem
The tendency for null results to go unpublished, causing the published
literature to overrepresent positive findings.
formative evaluation
Evaluation conducted during development to identify problems and guide
improvement, as opposed to summative evaluation which assesses a
completed product. Formative studies typically use small samples and
qualitative methods; their goal is to find things to fix, not to
measure how often problems occur.
funnel plot
A scatter plot of effect size against sample size used in meta-analysis;
asymmetry suggests publication bias.
G
gerund coding
A practice from grounded theory in which codes are written as verb
phrases rather than nouns, capturing the actions, choices, and processes
participants describe rather than labeling a static category.
Goal-Question-Metric (GQM)
A structured approach to defining measurements in which a study goal is
decomposed into questions whose answers would indicate success, and each
question is linked to a specific operationalized metric.
Goodhart's Law
The principle that when a measure becomes a target, it ceases to be a good
measure, because people optimize the measured variable rather than the
underlying goal it was meant to track.
grounded theory
A qualitative research methodology in which theory is developed inductively
from data through iterative coding and constant comparison, rather than
testing a predetermined hypothesis.
guerrilla research
Informal user research conducted without formal institutional support,
using convenience samples recruited opportunistically and methods that
prioritize speed and low cost over statistical rigor. Appropriate for
formative evaluation; not appropriate for making quantitative claims
about populations.
H
habituation
A decrease in response to a repeated stimulus over time, which threatens
internal validity.
HARKing
Hypothesizing After Results are Known: presenting exploratory findings
as if they had been predicted in advance, inflating false positive rates.
Hawthorne effect
The tendency for people to change their behavior when they know they
are being observed, independent of any intervention being studied.
Named for a series of workplace studies at the Hawthorne Works factory
in the 1920s and 1930s, it causes measured performance to differ from
typical performance and is a threat to internal validity in studies
where participants know they are being watched.
I
independent variable
The variable that is manipulated or selected by the researcher to
examine its effect on an outcome.
informed consent
A research ethics requirement that participants know what data is being
collected, how it will be used, and that they can withdraw without penalty.
intercoder reliability
The degree of agreement between two or more researchers independently
applying the same coding scheme to qualitative data. Commonly measured
with Cohen's kappa.
internal survey
A survey administered within a single organization to collect data from
employees, commonly used to measure team practices, tool adoption, or
workplace satisfaction.
internal validity
The degree to which a study can support the conclusion that the treatment
caused the observed outcome, as opposed to some other explanation.
interrupted time series
A quasi-experimental design that looks for a change in trend at the point
when an intervention occurred, using data from before and after.
J
K
L
lead time for change
The elapsed time between a code change being committed and that change
running in production; one of the four DORA metrics used to measure
software delivery performance.
leading question
A question worded in a way that suggests or implies a particular answer,
potentially biasing respondents' replies.
learning effect
A threat to internal validity in within-subjects designs: participants
improve on a later task or condition simply through practice, making the
later treatment appear more effective regardless of its merit.
Likert scale
A survey response format using ordered categories such as "Strongly
agree" to "Strongly disagree," typically with 5 or 7 points.
longitudinal study
A study that collects data from the same subjects at multiple points
over an extended period, enabling researchers to track changes and
developmental trends over time.
M
maturation effect
A threat to internal validity in which participants naturally change
over the course of a study (e.g., growing older, more tired, or more
experienced) independently of the treatment.
meta-analysis
A statistical technique for combining results from multiple independent
studies of the same question to estimate an overall effect size.
mining software repositories (MSR)
A research approach that extracts and analyzes data from version control
systems, issue trackers, code review tools, and related sources.
N
natural experiment
A study that exploits real-world variation that approximates random
assignment, without the researcher directly manipulating any variable.
non-response bias
A bias that arises when researchers analyze only survey respondents
without accounting for those who did not reply, which can skew results
if non-respondents differ systematically from respondents.
novelty effect
A threat to internal validity in which any new tool, method, or process
receives a temporary performance boost because participants are motivated
by its novelty, inflating the apparent benefit of the intervention.
nuisance variable
A variable that can affect the outcome but is not of primary interest;
controlled through randomization, blocking, or statistical adjustment
to isolate the effect of interest.
null hypothesis
The hypothesis that there is no effect; denoted H₀. Statistical testing
evaluates whether the data is inconsistent with the null hypothesis.
null result
A study outcome in which no statistically significant effect is found;
also called a negative result.
O
observational study
A study in which the researcher measures variables as they naturally
occur, without manipulating any conditions.
open coding
The initial stage in qualitative data analysis in which the researcher
reads through data and attaches descriptive labels (codes) to segments
of text.
operationalization
The process of turning an abstract concept or research construct into
a specific, measurable variable that can be observed and recorded in
a study.
order effects
Changes in participant responses caused by the sequence in which
conditions are presented, such as practice or fatigue, rather than
the conditions themselves; a threat to internal validity in
within-subjects designs.
overgeneralization
Drawing conclusions that extend beyond what the data actually support,
typically by ignoring important differences in context, population, or
sample.
P
p-hacking
Trying multiple analyses or data subsets until p < 0.05 is achieved,
then reporting only that analysis, inflating the false positive rate.
p-value
The probability of observing data at least as extreme as the observed
data, assuming the null hypothesis is true. Not the probability that the
null hypothesis is true.
pre-registration
A practice in which researchers publicly commit to their hypotheses and
analysis plan before collecting data, reducing the risk of HARKing and
p-hacking.
preprint
A version of a research paper that has been posted publicly
before it has undergone formal peer review and official journal publication.
proxy metric
A measurable quantity used as a stand-in for a concept that is harder
to measure directly.
publication bias
The tendency for journals to publish studies with statistically
significant results more often than studies with null results.
Q
qualitative methods
Research approaches that collect and analyze non-numerical data such as
interviews, observations, and documents in order to understand meaning,
context, and experience.
quantitative methods
Research approaches that collect and analyze numerical data to describe
patterns, test hypotheses, or estimate effect sizes.
quasi-experimental design
A research design that compares groups without random assignment,
relying on pre-existing groups or naturally occurring conditions to
estimate treatment effects.
R
randomization
The random assignment of participants to experimental conditions,
distributing known and unknown confounders evenly across groups.
Rapid Iterative Testing and Evaluation (RITE)
A usability testing method in which the most severe problem identified
in each session is addressed before the next session begins, enabling
rapid improvement with small participant pools. Appropriate for
formative evaluation; not appropriate for comparing product versions or
producing generalizable estimates of performance.
retrospective analysis
An analysis of data collected before the research question was
formulated, such as examining historical logs, commit records, or past
surveys.
S
sampling strategy
The method used to select participants from a population. Strategies
include convenience sampling, random sampling, and stratified sampling.
saturation
In qualitative research, the point at which collecting additional data
no longer introduces new codes or themes.
selection bias
Bias introduced when the sample is not representative of the population
the researcher intends to generalize to.
semi-structured interview
An interview that follows a predetermined set of questions but allows
the interviewer to probe further or deviate based on participant
responses.
single blind study
An experiment in which participants do not know which condition they are in,
but the experimenters do. This reduces participant bias but leaves
experimenter influence as a possible threat.
social desirability bias
The tendency for survey or interview respondents to give answers they
believe are more socially acceptable rather than truthful responses,
distorting self-reported data.
statistical power
The probability that a study will detect an effect if one truly exists;
conventionally targeted at 0.80.
statistical significance
A determination that observed data is unlikely under the null
hypothesis, typically using a threshold of p < 0.05; does not
imply practical importance or a large effect.
stratified sampling
A sampling method in which the population is divided into subgroups
(strata) and participants are drawn from each subgroup to ensure the
sample reflects key population characteristics.
structured interview
An interview in which every participant is asked exactly the same
questions in the same order, enabling systematic comparison across
respondents.
study
A systematic attempt to collect and analyze evidence in order to test a
claim or answer a research question.
summative evaluation
Evaluation conducted after development is complete to assess whether a
product meets its goals, as opposed to formative evaluation conducted
during development to guide improvement. Summative studies require
controlled conditions and sufficient sample sizes to support quantitative
claims.
survivorship bias
Bias introduced by analyzing only cases that survived a selection process,
ignoring cases that did not.
T
task-based testing
A research method in which participants are given specific, realistic
tasks to complete with a product or system while the researcher observes
and records their behavior, as opposed to open-ended exploration that
produces impressions rather than observable actions.
thematic analysis
A qualitative analysis method in which data is coded and grouped into
themes through iterative passes.
think-aloud protocol
A data collection technique in which participants verbalize their
thoughts, decisions, and reactions as they perform tasks, providing
access to real-time reasoning that retrospective reporting misses.
Commonly used in usability testing and qualitative studies of how
people interact with tools and interfaces.
time to restore service
The time it takes to recover normal operation after a failure in
production; one of the four DORA metrics used to measure software
delivery performance.
treatment group
The group in an experiment that receives the intervention or
manipulation being studied; compared against the control group to
estimate the treatment effect.
triangulation
The use of multiple data sources, methods, or investigators to increase
confidence in a finding.
U
unconscious formalization
The tendency to treat informal, context-dependent practices as if they
were formal, rule-based procedures, leading to oversimplified models
of how work is actually done.
unstructured interview
An interview with no fixed list of questions, guided entirely by the
participant's responses, used to explore topics in depth without
constraining the direction of conversation.
V
W
winners curse
The phenomenon in which studies that achieve statistical significance
tend to overestimate the true effect size, because extreme results
are more likely to cross the significance threshold.
within subjects design
An experimental design in which each participant experiences all
conditions, allowing comparison of outcomes within the same
individual. This increases statistical power but is vulnerable to order
effects.