Computing a summary value (such as a count, sum, mean, or maximum)
for a group of rows in a dataframe.
Anscombe's quartet
Four datasets constructed to have nearly identical summary
statistics (mean, variance, correlation) but very different
distributions, used to illustrate why visualization must accompany
numerical analysis.
B
Bonferroni correction
An adjustment to the significance threshold when running multiple
statistical tests: divide the desired alpha level (e.g., 0.05) by
the number of tests to control the family-wise error rate.
bootstrap resampling
A method for estimating the variability of a statistic by
repeatedly drawing samples with replacement from the observed data
and computing the statistic on each sample.
box-and-whisker plot
A chart that displays the distribution of a variable using five
summary statistics: the minimum, 25th percentile, median, 75th
percentile, and maximum. Also called a box plot.
C
Cliff's delta
A non-parametric effect size measure for ordinal data, equal to
the probability that a randomly chosen value from group A is
larger than a randomly chosen value from group B, minus the
probability that it is smaller. Ranges from -1 to 1.
codebook
A written description of each code used in qualitative analysis,
including a definition, examples of text that fits the code, and
examples of text that does not. A codebook enables independent
researchers to apply the same coding scheme.
coefficient of determination (R²)
The proportion of variance in the dependent variable that is
explained by the regression model. Ranges from 0 to 1; higher
values indicate a better fit.
Cohen's d
A standardized measure of the difference between two group means,
expressed in units of the pooled standard deviation. Rough
guidelines: d ≈ 0.2 is small, d ≈ 0.5 is medium, d ≈ 0.8 is large.
Cohen's kappa
A statistic measuring agreement between two coders beyond what
would be expected by chance. Values below 0.4 indicate poor
agreement; above 0.8 indicates near-perfect agreement.
common-language effect size
An effect size measure that expresses results as a probability in
plain-language terms.
confidence interval
A range of values constructed so that, if the procedure were
repeated many times, the interval would contain the true parameter
value in a specified fraction (e.g., 95%) of repetitions.
confound
A third variable that affects both the apparent cause and the
apparent effect in a study, making it difficult to determine
whether the relationship between cause and effect is real.
constant comparison
A core technique in grounded theory in which each new piece of
data is compared against everything coded so far, ensuring that
the emerging theory is continuously refined.
D
deductive coding
Qualitative coding that applies a pre-defined framework or set of
categories to the data, in contrast to inductive coding where
categories emerge from the data.
E
effect size
A numerical measure of the magnitude of a difference or
relationship, independent of sample size. Common effect size
measures include Cohen's d, Cliff's delta, and the common-language
effect size.
external validity
The degree to which the findings of a study generalize beyond the
specific sample, setting, and time period studied.
F
G
Gini coefficient
A number between 0 and 1 measuring the inequality of a
distribution: 0 means perfect equality (everyone contributes
equally) and 1 means perfect inequality (one person contributes
everything).
Goodhart's Law
The principle that once a measure is used to evaluate people, they
adjust their behavior to optimize the measure, causing it to lose
its usefulness as an indicator of the underlying quantity.
grounded theory
A qualitative research method that builds theory from data through
iterative coding and constant comparison, rather than testing a
pre-specified hypothesis. It is appropriate when the phenomenon is
not well understood.
H
hero developer
A contributor who is responsible for a disproportionately large
share of work in a software project, often defined as the person
responsible for more than 80% of all commits.
histogram
A bar chart that shows the distribution of a numerical variable by
grouping values into bins and counting the number of observations
in each bin.
I
inductive coding
Qualitative coding in which categories or themes emerge from
reading the data, without a pre-defined framework.
inter-rater reliability
The degree to which two or more independent coders assign the same
codes to the same data. Commonly measured with Cohen's kappa or
Krippendorff's alpha.
interquartile range
The difference between the 75th and 25th percentiles of a
distribution; a robust measure of spread that is not affected by
extreme values.
J
jitter
A small random displacement added to data points in a
visualization to prevent them from overlapping. Useful when many
observations have the same or similar values.
join
An operation that combines two dataframes by matching rows on one
or more key columns. Types include inner, left, right, and outer
joins.
K
Kruskal-Wallis test
A non-parametric test for comparing more than two independent
groups, analogous to a one-way ANOVA but without the assumption of
normality.
L
linear regression
A statistical model that describes the relationship between a
dependent variable and one or more independent variables as a
straight line, minimizing the sum of squared residuals.
log scale
An axis scale in which each equal interval represents a
multiplication by a constant factor rather than an addition of a
constant amount. Useful for data that spans several orders of
magnitude or follows an exponential distribution.
Lorenz curve
A graph that plots the cumulative share of a total (e.g., commits)
against the cumulative share of contributors sorted from lowest to
highest contribution, used to visualize inequality.
M
Mann-Whitney U test
A non-parametric test for comparing two independent groups that
does not assume normality. Also called the Wilcoxon rank-sum test.
mean
The arithmetic average of a set of values, computed by summing all
values and dividing by the count. Sensitive to extreme values
(outliers).
median
The middle value in a sorted list of numbers. If the list has an
even number of values, the median is the average of the two middle
values. More robust to outliers than the mean.
mixed methods
Research that combines quantitative and qualitative data
collection and analysis in a single study. The quantitative part
answers "how much" and "how often"; the qualitative part answers
"why" and "in what context."
N
normality
The property of a distribution that matches (or closely
approximates) the normal (Gaussian) bell-curve shape. Many
parametric statistical tests assume normality.
null hypothesis
The default assumption in a statistical test that there is no
effect or no difference between groups. The test evaluates the
probability of observing the data if the null hypothesis were
true.
null value
A missing or unknown value in a dataset, represented in Polars as
null. Distinct from zero or an empty string.
O
observer effect
The tendency for people to change their behavior when they know
they are being studied, which can bias the results of
observational research.
P
p-hacking
The practice of running many statistical tests or trying many
analysis choices until a result with p < 0.05 is found, inflating
the false-positive rate. Also called data dredging.
p-value
The probability of observing data at least as extreme as what was
observed, assuming the null hypothesis is true. It is not the
probability that the null hypothesis is true or that the result is
a fluke.
Pearson correlation coefficient
A measure of the linear relationship between two numerical
variables, ranging from -1 (perfect negative linear relationship)
to 1 (perfect positive linear relationship). Denoted r.
percentile
The value below which a given fraction of observations fall. The
25th percentile (Q1) is the value below which 25% of observations
fall; the 75th percentile (Q3) is the value below which 75% fall.
pivot
A reshape operation that turns rows into columns (pivot) or
columns into rows (unpivot), changing the layout of a dataframe
without changing its content.
pre-registration
The practice of committing to a study's hypotheses, methods, and
analysis plan before collecting data, in order to prevent
p-hacking and hypothesizing after results are known (HARKing).
purposive sampling
A sampling strategy in which participants are selected because
they have relevant experience or characteristics, rather than at
random.
Q
QQ plot
A quantile-quantile plot that compares the distribution of a
dataset against a theoretical distribution (usually normal) by
plotting each quantile of the data against the corresponding
quantile of the reference distribution. Points that fall on a
straight line indicate a good fit.
qualitative data
Data consisting of text, images, or other non-numerical
information that captures meaning, context, or
experience. Analyzed by coding, thematic analysis, or grounded
theory rather than by statistical tests.
R
replication
Repeating a study with the same or different materials to check
whether the original findings hold. Exact replication uses the
same data and analysis; conceptual replication tests the same
hypothesis with different data or methods.
residual
The difference between an observed value and the value predicted
by a model. Examining residuals reveals whether the model's
assumptions are met.
S
scatter plot
A chart that displays the relationship between two numerical
variables by plotting one variable on the x-axis and the other on
the y-axis, with one point per observation.
selection bias
A distortion of results that occurs when the sample studied is not
representative of the population of interest because of how it was
selected.
skewness
A measure of the asymmetry of a distribution. A positive (right)
skew means the distribution has a long tail to the right; a
negative (left) skew has a long tail to the left.
snowball sampling
A sampling strategy in which existing participants refer others
with relevant experience, used when the target population is hard
to identify directly.
Spearman rank correlation
A measure of the monotonic relationship between two variables
based on their ranks rather than raw values. More robust than
Pearson correlation when data is ordinal or heavily skewed.
standard deviation
The square root of the variance; a measure of the spread of a
distribution in the same units as the original data.
statistical power
The probability that a statistical test will correctly reject the
null hypothesis when it is false. Higher power means a lower
chance of missing a real effect.
Student's t-test
A parametric test for comparing the means of two groups, assuming
the data in each group is approximately normally distributed.
survivorship bias
A form of selection bias in which only entities that "survived" a
process are studied, ignoring those that did not survive, leading
to overly optimistic conclusions.
T
thematic analysis
A qualitative method for identifying, analyzing, and reporting
patterns (themes) within textual data by systematically coding
segments and grouping codes into themes.
theoretical saturation
The point in grounded theory data collection at which new
interviews or observations stop producing new concepts, indicating
that the emerging theory is sufficiently developed.
tidy data
A standard way of organizing data in which each column represents
one variable, each row represents one observation, and each table
contains one set of observations (Wickham 2014).
triangulation
In mixed methods research, using multiple data sources or methods
to cross-check findings. When quantitative and qualitative results
point in the same direction, confidence in the finding increases.
Type I error
Rejecting the null hypothesis when it is actually true (a false
positive). The probability of a Type I error is the significance
level (alpha).
Type II error
Failing to reject the null hypothesis when it is actually false (a
false negative). The probability of a Type II error is 1 minus the
statistical power.
U
V
variance
The average of the squared differences from the mean; a measure of
how spread out a distribution is. Its square root is the standard
deviation.