Glossary

A

A/B testing: A controlled experiment in which participants or users are randomly assigned to one of two variants to measure which performs better on a defined metric.
affinity mapping: An analysis technique in which individual observations are written on separate notes and then grouped into clusters that share an underlying cause or theme, with each cluster named as a claim rather than a label.
agile manifesto: The 2001 statement of four values and twelve principles for agile software development, emphasizing individuals and interactions, working software, customer collaboration, and responding to change.
alternative hypothesis: The hypothesis that an effect exists; denoted H₁. Accepted when the null hypothesis is rejected.
axial coding: A stage in grounded theory analysis in which open codes are grouped into higher-level themes and the relationships between categories are examined.

B

blinding: An experimental design feature in which participants (single-blind), experimenters, or both (double-blind) do not know which condition a participant is in, reducing expectation bias.
blocking variable: A variable known to affect the outcome but not of experimental interest; controlled by grouping experimental units into blocks of similar values before randomization, guaranteeing that the nuisance factor's effect is distributed evenly across treatment conditions.
Bonferroni correction: A method for adjusting significance thresholds when performing multiple statistical tests, dividing the desired alpha level by the number of comparisons to control the family-wise error rate.
boredom effect: A threat to internal validity in which participants' performance or attention degrades over time due to fatigue or disengagement, rather than any effect of the treatment.

C

change failure rate: The proportion of deployments that result in degraded service or require remediation; one of the four DORA metrics used to measure software delivery performance.
claim: An assertion about the world that a study attempts to support or refute through evidence.
Cohen's d: A measure of effect size for comparing two means, expressed in units of pooled standard deviation. Conventionally, d = 0.2 is small, d = 0.5 is medium, and d = 0.8 is large.
conclusion validity: The degree to which a study's statistical analysis is appropriate and adequately powered to detect the effect it reports; threatened by underpowered designs, violated test assumptions, and uncorrected multiple comparisons.
confirmation bias: The tendency to search for, interpret, and recall information in ways that confirm pre-existing beliefs, distorting the analysis of data.
confirmatory analysis: Analysis designed to test a pre-specified hypothesis using data collected expressly for that purpose, as opposed to exploratory analysis which searches for patterns without a prior hypothesis.
confounding variable: A variable that is associated with both the independent variable and the outcome, potentially explaining an observed relationship without there being a direct causal link.
construct validity: The degree to which a measurement actually captures the concept it is intended to represent.
control group: The group in an experiment that does not receive the treatment; used as a baseline for comparison.
controlled experiment: A study in which one or more independent variables are manipulated while other factors are held constant, allowing causal claims to be made.
convenience sampling: Selecting participants based on ease of access rather than random selection, making the sample easier to recruit but harder to generalize from.

D

dependent variable: The outcome measured in an experiment; the variable expected to change in response to manipulation of the independent variable.
deployment frequency: How often an organization successfully deploys code to production; one of the four DORA metrics used to measure software delivery performance.
difference-in-differences: A quasi-experimental design that compares the change in outcome for a treated group to the change for a control group over the same period.
double blind study: An experiment in which neither the participants nor the experimenters interacting with them know which condition each participant is in, reducing both participant expectation effects and experimenter bias.

E

effect size: A measure of the magnitude of an effect, independent of sample size. Examples include Cohen's d, odds ratio, and Pearson's r.
empirical software engineering: The subfield of software engineering that uses systematic data collection and analysis to study how software is built, maintained, and used.
evidence: Data or observations gathered through study to support or refute a claim, which varies in quality and reliability depending on how it was collected.
experience sampling method: A research technique in which participants are prompted at random or scheduled intervals during their normal activities to report their current task, state, or perceptions, capturing in-the-moment data rather than retrospective recall.
exploratory analysis: Analysis that searches for patterns, relationships, or hypotheses in data without pre-specified predictions; used to generate rather than test hypotheses.
external validity: The degree to which study findings generalize beyond the specific sample, setting, and task studied.

F

file drawer problem: The tendency for null results to go unpublished, causing the published literature to overrepresent positive findings.
formative evaluation: Evaluation conducted during development to identify problems and guide improvement, as opposed to summative evaluation which assesses a completed product. Formative studies typically use small samples and qualitative methods; their goal is to find things to fix, not to measure how often problems occur.
funnel plot: A scatter plot of effect size against sample size used in meta-analysis; asymmetry suggests publication bias.

G

gerund coding: A practice from grounded theory in which codes are written as verb phrases rather than nouns, capturing the actions, choices, and processes participants describe rather than labeling a static category.
Goal-Question-Metric (GQM): A structured approach to defining measurements in which a study goal is decomposed into questions whose answers would indicate success, and each question is linked to a specific operationalized metric.
Goodhart's Law: The principle that when a measure becomes a target, it ceases to be a good measure, because people optimize the measured variable rather than the underlying goal it was meant to track.
grounded theory: A qualitative research methodology in which theory is developed inductively from data through iterative coding and constant comparison, rather than testing a predetermined hypothesis.
guerrilla research: Informal user research conducted without formal institutional support, using convenience samples recruited opportunistically and methods that prioritize speed and low cost over statistical rigor. Appropriate for formative evaluation; not appropriate for making quantitative claims about populations.

H

habituation: A decrease in response to a repeated stimulus over time, which threatens internal validity.
HARKing: Hypothesizing After Results are Known: presenting exploratory findings as if they had been predicted in advance, inflating false positive rates.
Hawthorne effect: The tendency for people to change their behavior when they know they are being observed, independent of any intervention being studied. Named for a series of workplace studies at the Hawthorne Works factory in the 1920s and 1930s, it causes measured performance to differ from typical performance and is a threat to internal validity in studies where participants know they are being watched.

I

independent variable: The variable that is manipulated or selected by the researcher to examine its effect on an outcome.
informed consent: A research ethics requirement that participants know what data is being collected, how it will be used, and that they can withdraw without penalty.
intercoder reliability: The degree of agreement between two or more researchers independently applying the same coding scheme to qualitative data. Commonly measured with Cohen's kappa.
internal survey: A survey administered within a single organization to collect data from employees, commonly used to measure team practices, tool adoption, or workplace satisfaction.
internal validity: The degree to which a study can support the conclusion that the treatment caused the observed outcome, as opposed to some other explanation.
interrupted time series: A quasi-experimental design that looks for a change in trend at the point when an intervention occurred, using data from before and after.

J

K

L

lead time for change: The elapsed time between a code change being committed and that change running in production; one of the four DORA metrics used to measure software delivery performance.
leading question: A question worded in a way that suggests or implies a particular answer, potentially biasing respondents' replies.
learning effect: A threat to internal validity in within-subjects designs: participants improve on a later task or condition simply through practice, making the later treatment appear more effective regardless of its merit.
Likert scale: A survey response format using ordered categories such as "Strongly agree" to "Strongly disagree," typically with 5 or 7 points.
longitudinal study: A study that collects data from the same subjects at multiple points over an extended period, enabling researchers to track changes and developmental trends over time.

M

maturation effect: A threat to internal validity in which participants naturally change over the course of a study (e.g., growing older, more tired, or more experienced) independently of the treatment.
meta-analysis: A statistical technique for combining results from multiple independent studies of the same question to estimate an overall effect size.
mining software repositories (MSR): A research approach that extracts and analyzes data from version control systems, issue trackers, code review tools, and related sources.

N

natural experiment: A study that exploits real-world variation that approximates random assignment, without the researcher directly manipulating any variable.
non-response bias: A bias that arises when researchers analyze only survey respondents without accounting for those who did not reply, which can skew results if non-respondents differ systematically from respondents.
novelty effect: A threat to internal validity in which any new tool, method, or process receives a temporary performance boost because participants are motivated by its novelty, inflating the apparent benefit of the intervention.
nuisance variable: A variable that can affect the outcome but is not of primary interest; controlled through randomization, blocking, or statistical adjustment to isolate the effect of interest.
null hypothesis: The hypothesis that there is no effect; denoted H₀. Statistical testing evaluates whether the data is inconsistent with the null hypothesis.
null result: A study outcome in which no statistically significant effect is found; also called a negative result.

O

observational study: A study in which the researcher measures variables as they naturally occur, without manipulating any conditions.
open coding: The initial stage in qualitative data analysis in which the researcher reads through data and attaches descriptive labels (codes) to segments of text.
operationalization: The process of turning an abstract concept or research construct into a specific, measurable variable that can be observed and recorded in a study.
order effects: Changes in participant responses caused by the sequence in which conditions are presented, such as practice or fatigue, rather than the conditions themselves; a threat to internal validity in within-subjects designs.
overgeneralization: Drawing conclusions that extend beyond what the data actually support, typically by ignoring important differences in context, population, or sample.

P

p-hacking: Trying multiple analyses or data subsets until p < 0.05 is achieved, then reporting only that analysis, inflating the false positive rate.
p-value: The probability of observing data at least as extreme as the observed data, assuming the null hypothesis is true. Not the probability that the null hypothesis is true.
pre-registration: A practice in which researchers publicly commit to their hypotheses and analysis plan before collecting data, reducing the risk of HARKing and p-hacking.
preprint: A version of a research paper that has been posted publicly before it has undergone formal peer review and official journal publication.
proxy metric: A measurable quantity used as a stand-in for a concept that is harder to measure directly.
publication bias: The tendency for journals to publish studies with statistically significant results more often than studies with null results.

Q

qualitative methods: Research approaches that collect and analyze non-numerical data such as interviews, observations, and documents in order to understand meaning, context, and experience.
quantitative methods: Research approaches that collect and analyze numerical data to describe patterns, test hypotheses, or estimate effect sizes.
quasi-experimental design: A research design that compares groups without random assignment, relying on pre-existing groups or naturally occurring conditions to estimate treatment effects.

R

randomization: The random assignment of participants to experimental conditions, distributing known and unknown confounders evenly across groups.
Rapid Iterative Testing and Evaluation (RITE): A usability testing method in which the most severe problem identified in each session is addressed before the next session begins, enabling rapid improvement with small participant pools. Appropriate for formative evaluation; not appropriate for comparing product versions or producing generalizable estimates of performance.
retrospective analysis: An analysis of data collected before the research question was formulated, such as examining historical logs, commit records, or past surveys.

S

sampling strategy: The method used to select participants from a population. Strategies include convenience sampling, random sampling, and stratified sampling.
saturation: In qualitative research, the point at which collecting additional data no longer introduces new codes or themes.
selection bias: Bias introduced when the sample is not representative of the population the researcher intends to generalize to.
semi-structured interview: An interview that follows a predetermined set of questions but allows the interviewer to probe further or deviate based on participant responses.
single blind study: An experiment in which participants do not know which condition they are in, but the experimenters do. This reduces participant bias but leaves experimenter influence as a possible threat.
social desirability bias: The tendency for survey or interview respondents to give answers they believe are more socially acceptable rather than truthful responses, distorting self-reported data.
statistical power: The probability that a study will detect an effect if one truly exists; conventionally targeted at 0.80.
statistical significance: A determination that observed data is unlikely under the null hypothesis, typically using a threshold of p < 0.05; does not imply practical importance or a large effect.
stratified sampling: A sampling method in which the population is divided into subgroups (strata) and participants are drawn from each subgroup to ensure the sample reflects key population characteristics.
structured interview: An interview in which every participant is asked exactly the same questions in the same order, enabling systematic comparison across respondents.
study: A systematic attempt to collect and analyze evidence in order to test a claim or answer a research question.
summative evaluation: Evaluation conducted after development is complete to assess whether a product meets its goals, as opposed to formative evaluation conducted during development to guide improvement. Summative studies require controlled conditions and sufficient sample sizes to support quantitative claims.
survivorship bias: Bias introduced by analyzing only cases that survived a selection process, ignoring cases that did not.

T

task-based testing: A research method in which participants are given specific, realistic tasks to complete with a product or system while the researcher observes and records their behavior, as opposed to open-ended exploration that produces impressions rather than observable actions.
thematic analysis: A qualitative analysis method in which data is coded and grouped into themes through iterative passes.
think-aloud protocol: A data collection technique in which participants verbalize their thoughts, decisions, and reactions as they perform tasks, providing access to real-time reasoning that retrospective reporting misses. Commonly used in usability testing and qualitative studies of how people interact with tools and interfaces.
time to restore service: The time it takes to recover normal operation after a failure in production; one of the four DORA metrics used to measure software delivery performance.
treatment group: The group in an experiment that receives the intervention or manipulation being studied; compared against the control group to estimate the treatment effect.
triangulation: The use of multiple data sources, methods, or investigators to increase confidence in a finding.

U

unconscious formalization: The tendency to treat informal, context-dependent practices as if they were formal, rule-based procedures, leading to oversimplified models of how work is actually done.
unstructured interview: An interview with no fixed list of questions, guided entirely by the participant's responses, used to explore topics in depth without constraining the direction of conversation.

V

W

winners curse: The phenomenon in which studies that achieve statistical significance tend to overestimate the true effect size, because extreme results are more likely to cross the significance threshold.
within subjects design: An experimental design in which each participant experiences all conditions, allowing comparison of outcomes within the same individual. This increases statistical power but is vulnerable to order effects.

Glossary

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z