Glossary

A

aggregation: Computing a summary value (such as a count, sum, mean, or maximum) for a group of rows in a dataframe.
Anscombe's quartet: Four datasets constructed to have nearly identical summary statistics (mean, variance, correlation) but very different distributions, used to illustrate why visualization must accompany numerical analysis.

B

Bonferroni correction: An adjustment to the significance threshold when running multiple statistical tests: divide the desired alpha level (e.g., 0.05) by the number of tests to control the family-wise error rate.
bootstrap resampling: A method for estimating the variability of a statistic by repeatedly drawing samples with replacement from the observed data and computing the statistic on each sample.
box-and-whisker plot: A chart that displays the distribution of a variable using five summary statistics: the minimum, 25th percentile, median, 75th percentile, and maximum. Also called a box plot.

C

Cliff's delta: A non-parametric effect size measure for ordinal data, equal to the probability that a randomly chosen value from group A is larger than a randomly chosen value from group B, minus the probability that it is smaller. Ranges from -1 to 1.
codebook: A written description of each code used in qualitative analysis, including a definition, examples of text that fits the code, and examples of text that does not. A codebook enables independent researchers to apply the same coding scheme.
coefficient of determination (R²): The proportion of variance in the dependent variable that is explained by the regression model. Ranges from 0 to 1; higher values indicate a better fit.
Cohen's d: A standardized measure of the difference between two group means, expressed in units of the pooled standard deviation. Rough guidelines: d ≈ 0.2 is small, d ≈ 0.5 is medium, d ≈ 0.8 is large.
Cohen's kappa: A statistic measuring agreement between two coders beyond what would be expected by chance. Values below 0.4 indicate poor agreement; above 0.8 indicates near-perfect agreement.
common-language effect size: An effect size measure that expresses results as a probability in plain-language terms.
confidence interval: A range of values constructed so that, if the procedure were repeated many times, the interval would contain the true parameter value in a specified fraction (e.g., 95%) of repetitions.
confound: A third variable that affects both the apparent cause and the apparent effect in a study, making it difficult to determine whether the relationship between cause and effect is real.
constant comparison: A core technique in grounded theory in which each new piece of data is compared against everything coded so far, ensuring that the emerging theory is continuously refined.

D

deductive coding: Qualitative coding that applies a pre-defined framework or set of categories to the data, in contrast to inductive coding where categories emerge from the data.

E

effect size: A numerical measure of the magnitude of a difference or relationship, independent of sample size. Common effect size measures include Cohen's d, Cliff's delta, and the common-language effect size.
external validity: The degree to which the findings of a study generalize beyond the specific sample, setting, and time period studied.

F

G

Gini coefficient: A number between 0 and 1 measuring the inequality of a distribution: 0 means perfect equality (everyone contributes equally) and 1 means perfect inequality (one person contributes everything).
Goodhart's Law: The principle that once a measure is used to evaluate people, they adjust their behavior to optimize the measure, causing it to lose its usefulness as an indicator of the underlying quantity.
grounded theory: A qualitative research method that builds theory from data through iterative coding and constant comparison, rather than testing a pre-specified hypothesis. It is appropriate when the phenomenon is not well understood.

H

hero developer: A contributor who is responsible for a disproportionately large share of work in a software project, often defined as the person responsible for more than 80% of all commits.
histogram: A bar chart that shows the distribution of a numerical variable by grouping values into bins and counting the number of observations in each bin.

I

inductive coding: Qualitative coding in which categories or themes emerge from reading the data, without a pre-defined framework.
inter-rater reliability: The degree to which two or more independent coders assign the same codes to the same data. Commonly measured with Cohen's kappa or Krippendorff's alpha.
interquartile range: The difference between the 75th and 25th percentiles of a distribution; a robust measure of spread that is not affected by extreme values.

J

jitter: A small random displacement added to data points in a visualization to prevent them from overlapping. Useful when many observations have the same or similar values.
join: An operation that combines two dataframes by matching rows on one or more key columns. Types include inner, left, right, and outer joins.

K

Kruskal-Wallis test: A non-parametric test for comparing more than two independent groups, analogous to a one-way ANOVA but without the assumption of normality.

L

linear regression: A statistical model that describes the relationship between a dependent variable and one or more independent variables as a straight line, minimizing the sum of squared residuals.
log scale: An axis scale in which each equal interval represents a multiplication by a constant factor rather than an addition of a constant amount. Useful for data that spans several orders of magnitude or follows an exponential distribution.
Lorenz curve: A graph that plots the cumulative share of a total (e.g., commits) against the cumulative share of contributors sorted from lowest to highest contribution, used to visualize inequality.

M

Mann-Whitney U test: A non-parametric test for comparing two independent groups that does not assume normality. Also called the Wilcoxon rank-sum test.
mean: The arithmetic average of a set of values, computed by summing all values and dividing by the count. Sensitive to extreme values (outliers).
median: The middle value in a sorted list of numbers. If the list has an even number of values, the median is the average of the two middle values. More robust to outliers than the mean.
mixed methods: Research that combines quantitative and qualitative data collection and analysis in a single study. The quantitative part answers "how much" and "how often"; the qualitative part answers "why" and "in what context."

N

normality: The property of a distribution that matches (or closely approximates) the normal (Gaussian) bell-curve shape. Many parametric statistical tests assume normality.
null hypothesis: The default assumption in a statistical test that there is no effect or no difference between groups. The test evaluates the probability of observing the data if the null hypothesis were true.
null value: A missing or unknown value in a dataset, represented in Polars as null. Distinct from zero or an empty string.

O

observer effect: The tendency for people to change their behavior when they know they are being studied, which can bias the results of observational research.

P

p-hacking: The practice of running many statistical tests or trying many analysis choices until a result with p < 0.05 is found, inflating the false-positive rate. Also called data dredging.
p-value: The probability of observing data at least as extreme as what was observed, assuming the null hypothesis is true. It is not the probability that the null hypothesis is true or that the result is a fluke.
Pearson correlation coefficient: A measure of the linear relationship between two numerical variables, ranging from -1 (perfect negative linear relationship) to 1 (perfect positive linear relationship). Denoted r.
percentile: The value below which a given fraction of observations fall. The 25th percentile (Q1) is the value below which 25% of observations fall; the 75th percentile (Q3) is the value below which 75% fall.
pivot: A reshape operation that turns rows into columns (pivot) or columns into rows (unpivot), changing the layout of a dataframe without changing its content.
pre-registration: The practice of committing to a study's hypotheses, methods, and analysis plan before collecting data, in order to prevent p-hacking and hypothesizing after results are known (HARKing).
purposive sampling: A sampling strategy in which participants are selected because they have relevant experience or characteristics, rather than at random.

Q

QQ plot: A quantile-quantile plot that compares the distribution of a dataset against a theoretical distribution (usually normal) by plotting each quantile of the data against the corresponding quantile of the reference distribution. Points that fall on a straight line indicate a good fit.
qualitative data: Data consisting of text, images, or other non-numerical information that captures meaning, context, or experience. Analyzed by coding, thematic analysis, or grounded theory rather than by statistical tests.

R

replication: Repeating a study with the same or different materials to check whether the original findings hold. Exact replication uses the same data and analysis; conceptual replication tests the same hypothesis with different data or methods.
residual: The difference between an observed value and the value predicted by a model. Examining residuals reveals whether the model's assumptions are met.

S

scatter plot: A chart that displays the relationship between two numerical variables by plotting one variable on the x-axis and the other on the y-axis, with one point per observation.
selection bias: A distortion of results that occurs when the sample studied is not representative of the population of interest because of how it was selected.
skewness: A measure of the asymmetry of a distribution. A positive (right) skew means the distribution has a long tail to the right; a negative (left) skew has a long tail to the left.
snowball sampling: A sampling strategy in which existing participants refer others with relevant experience, used when the target population is hard to identify directly.
Spearman rank correlation: A measure of the monotonic relationship between two variables based on their ranks rather than raw values. More robust than Pearson correlation when data is ordinal or heavily skewed.
standard deviation: The square root of the variance; a measure of the spread of a distribution in the same units as the original data.
statistical power: The probability that a statistical test will correctly reject the null hypothesis when it is false. Higher power means a lower chance of missing a real effect.
Student's t-test: A parametric test for comparing the means of two groups, assuming the data in each group is approximately normally distributed.
survivorship bias: A form of selection bias in which only entities that "survived" a process are studied, ignoring those that did not survive, leading to overly optimistic conclusions.

T

thematic analysis: A qualitative method for identifying, analyzing, and reporting patterns (themes) within textual data by systematically coding segments and grouping codes into themes.
theoretical saturation: The point in grounded theory data collection at which new interviews or observations stop producing new concepts, indicating that the emerging theory is sufficiently developed.
tidy data: A standard way of organizing data in which each column represents one variable, each row represents one observation, and each table contains one set of observations (Wickham 2014).
triangulation: In mixed methods research, using multiple data sources or methods to cross-check findings. When quantitative and qualitative results point in the same direction, confidence in the finding increases.
Type I error: Rejecting the null hypothesis when it is actually true (a false positive). The probability of a Type I error is the significance level (alpha).
Type II error: Failing to reject the null hypothesis when it is actually false (a false negative). The probability of a Type II error is 1 minus the statistical power.

U

V

variance: The average of the squared differences from the mean; a measure of how spread out a distribution is. Its square root is the standard deviation.

Glossary

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z