A controlled experiment in which participants or users are randomly
assigned to one of two variants to measure which performs better on a
defined metric.
alternative hypothesis
The hypothesis that an effect exists; denoted H₁. Accepted when the
null hypothesis is rejected.
axial coding
A stage in grounded theory analysis in which open codes are grouped
into higher-level themes and the relationships between categories are
examined.
B
blinding
An experimental design feature in which participants (single-blind),
experimenters, or both (double-blind) do not know which condition a
participant is in, reducing expectation bias.
blocking variable
A variable known to affect the outcome but not of experimental interest;
controlled by grouping experimental units into blocks of similar values
before randomization, guaranteeing that the nuisance factor's effect is
distributed evenly across treatment conditions.
C
change failure rate
The proportion of deployments that result in degraded service or
require remediation; one of the four DORA metrics used to measure
software delivery performance.
claim
An assertion about the world that a study attempts to support or refute
through evidence.
Cohen's d
A measure of effect size for comparing two means, expressed in units of
pooled standard deviation. Conventionally, d = 0.2 is small, d = 0.5 is
medium, and d = 0.8 is large.
conclusion validity
The degree to which a study's statistical analysis is appropriate and
adequately powered to detect the effect it reports; threatened by
underpowered designs, violated test assumptions, and uncorrected
multiple comparisons.
confirmation bias
The tendency to search for, interpret, and recall information in ways
that confirm pre-existing beliefs, distorting the analysis of data.
confounding variable
A variable that is associated with both the independent variable and the
outcome, potentially explaining an observed relationship without there
being a direct causal link.
construct validity
The degree to which a measurement actually captures the concept it is
intended to represent.
control group
The group in an experiment that does not receive the treatment; used as
a baseline for comparison.
controlled experiment
A study in which one or more independent variables are manipulated while
other factors are held constant, allowing causal claims to be made.
convenience sampling
Selecting participants based on ease of access rather than random
selection, making the sample easier to recruit but harder to generalize
from.
D
dependent variable
The outcome measured in an experiment; the variable expected to change
in response to manipulation of the independent variable.
deployment frequency
How often an organization successfully deploys code to production; one
of the four DORA metrics used to measure software delivery performance.
difference-in-differences
A quasi-experimental design that compares the change in outcome for a
treated group to the change for a control group over the same period.
directed acyclic graph (DAG)
A diagram of nodes connected by directed edges with no cycles, used to
represent causal assumptions about the relationships between variables.
E
effect size
A measure of the magnitude of an effect, independent of sample size.
Examples include Cohen's d, odds ratio, and Pearson's r.
empirical software engineering
The subfield of software engineering that uses systematic data
collection and analysis to study how software is built, maintained, and
used.
evidence
Data or observations gathered through study to support or refute a
claim, which varies in quality and reliability depending on how it was
collected.
experience sampling method
A research technique in which participants are prompted at random or
scheduled intervals during their normal activities to report their current
task, state, or perceptions, capturing in-the-moment data rather than
retrospective recall.
external validity
The degree to which study findings generalize beyond the specific sample,
setting, and task studied.
F
file drawer problem
The tendency for null results to go unpublished, causing the published
literature to overrepresent positive findings.
funnel plot
A scatter plot of effect size against sample size used in meta-analysis;
asymmetry suggests publication bias.
G
gerund coding
A practice from grounded theory in which codes are written as verb
phrases rather than nouns, capturing the actions, choices, and processes
participants describe rather than labeling a static category.
Goal-Question-Metric (GQM)
A structured approach to defining measurements in which a study goal is
decomposed into questions whose answers would indicate success, and each
question is linked to a specific operationalized metric.
Goodhart's Law
The principle that when a measure becomes a target, it ceases to be a good
measure, because people optimize the measured variable rather than the
underlying goal it was meant to track.
grounded theory
A qualitative research methodology in which theory is developed inductively
from data through iterative coding and constant comparison, rather than
testing a predetermined hypothesis.
H
HARKing
Hypothesizing After Results are Known: presenting exploratory findings
as if they had been predicted in advance, inflating false positive rates.
I
ignoring non-response
A bias that arises when researchers analyze only survey respondents
without accounting for those who did not reply, which can skew results
if non-respondents differ systematically from respondents.
independent variable
The variable that is manipulated or selected by the researcher to
examine its effect on an outcome.
informed consent
A research ethics requirement that participants know what data is being
collected, how it will be used, and that they can withdraw without penalty.
intercoder reliability
The degree of agreement between two or more researchers independently
applying the same coding scheme to qualitative data. Commonly measured
with Cohen's kappa.
internal survey
A survey administered within a single organization to collect data from
employees, commonly used to measure team practices, tool adoption, or
workplace satisfaction.
internal validity
The degree to which a study can support the conclusion that the treatment
caused the observed outcome, as opposed to some other explanation.
interrupted time series
A quasi-experimental design that looks for a change in trend at the point
when an intervention occurred, using data from before and after.
J
K
L
lead time for change
The elapsed time between a code change being committed and that change
running in production; one of the four DORA metrics used to measure
software delivery performance.
leading question
A question worded in a way that suggests or implies a particular answer,
potentially biasing respondents' replies.
learning effect
A threat to internal validity in within-subjects designs: participants
improve on a later task or condition simply through practice, making the
later treatment appear more effective regardless of its merit.
Likert scale
A survey response format using ordered categories such as "Strongly
agree" to "Strongly disagree," typically with 5 or 7 points.
M
meta-analysis
A statistical technique for combining results from multiple independent
studies of the same question to estimate an overall effect size.
mining software repositories (MSR)
A research approach that extracts and analyzes data from version control
systems, issue trackers, code review tools, and related sources.
N
natural experiment
A study that exploits real-world variation that approximates random
assignment, without the researcher directly manipulating any variable.
novelty effect
A threat to internal validity in which any new tool, method, or process
receives a temporary performance boost because participants are motivated
by its novelty, inflating the apparent benefit of the intervention.
null hypothesis
The hypothesis that there is no effect; denoted H₀. Statistical testing
evaluates whether the data is inconsistent with the null hypothesis.
O
observational study
A study in which the researcher measures variables as they naturally
occur, without manipulating any conditions.
open coding
The initial stage in qualitative data analysis in which the researcher
reads through data and attaches descriptive labels (codes) to segments
of text.
overgeneralization
Drawing conclusions that extend beyond what the data actually support,
typically by ignoring important differences in context, population, or
sample.
P
p-hacking
Trying multiple analyses or data subsets until p < 0.05 is achieved,
then reporting only that analysis, inflating the false positive rate.
p-value
The probability of observing data at least as extreme as the observed
data, assuming the null hypothesis is true. Not the probability that the
null hypothesis is true.
pre-registration
A practice in which researchers publicly commit to their hypotheses and
analysis plan before collecting data, reducing the risk of HARKing and
p-hacking.
proxy metric
A measurable quantity used as a stand-in for a concept that is harder
to measure directly.
publication bias
The tendency for journals to publish studies with statistically
significant results more often than studies with null results.
Q
qualitative methods
Research approaches that collect and analyze non-numerical data such as
interviews, observations, and documents in order to understand meaning,
context, and experience.
quantitative methods
Research approaches that collect and analyze numerical data to describe
patterns, test hypotheses, or estimate effect sizes.
R
randomization
The random assignment of participants to experimental conditions,
distributing known and unknown confounders evenly across groups.
retrospective analysis
An analysis of data collected before the research question was
formulated, such as examining historical logs, commit records, or past
surveys.
S
sampling strategy
The method used to select participants from a population. Strategies
include convenience sampling, random sampling, and stratified sampling.
saturation
In qualitative research, the point at which collecting additional data
no longer introduces new codes or themes.
selection bias
Bias introduced when the sample is not representative of the population
the researcher intends to generalize to.
semi-structured interview
An interview that follows a predetermined set of questions but allows
the interviewer to probe further or deviate based on participant
responses.
statistical power
The probability that a study will detect an effect if one truly exists;
conventionally targeted at 0.80.
structured interview
An interview in which every participant is asked exactly the same
questions in the same order, enabling systematic comparison across
respondents.
study
A systematic attempt to collect and analyze evidence in order to test a
claim or answer a research question.
survivorship bias
Bias introduced by analyzing only cases that survived a selection process,
ignoring cases that did not.
T
thematic analysis
A qualitative analysis method in which data is coded and grouped into
themes through iterative passes.
time to restore service
The time it takes to recover normal operation after a failure in
production; one of the four DORA metrics used to measure software
delivery performance.
treatment
The intervention applied to the treatment group in an experiment.
triangulation
The use of multiple data sources, methods, or investigators to increase
confidence in a finding.
U
unstructured interview
An interview with no fixed list of questions, guided entirely by the
participant's responses, used to explore topics in depth without
constraining the direction of conversation.