Mining Software Repositories
Learning Goals
- Extract contribution data from git commit histories using GitPython
- Compute Gini coefficients and plot Lorenz curves
- Identify hero developer patterns in open-source projects
- Recognize ethical and data-quality issues in MSR work
Git Objects and History
- Git stores four kinds of objects: blobs (file contents), trees (directory snapshots), commits (pointers to trees with metadata), and tags
- Every commit records a tree, a parent commit, an author, a committer, a timestamp, and a message
- The author and committer can differ: a patch author and the person who merged it are both recorded
- GitPython gives you Python-level access to a local repository without shelling out to
gitrepo.iter_commits()walks the commit graph from HEAD backwardcommit.author.nameandcommit.author.emailidentify the person who wrote the changecommit.stats.totalsummarizes lines added and deleted across all files in that commit
Measuring Contribution
- Three common metrics: number of commits, lines added, and lines deleted
- These correlate strongly but not perfectly
- A contributor who adds 10,000 lines in one commit and one who makes 100 small fixes look very different by commit count but possibly similar by lines added
- Commit count is the most common metric in MSR studies because it is cheap to compute and robust to whitespace-only changes
- Lines-added counts are inflated by code generation, vendored libraries, and bulk reformatting
- Deleting code also contributes; weighting insertions and deletions equally is a defensible choice that you should state explicitly
Gini Coefficient and Lorenz Curve
- The Gini coefficient is a single number measuring inequality in a distribution
- 0 means perfect equality (every contributor has the same share)
- 1 means one person does everything and everyone else does nothing
- For commit counts, values above 0.7 are common in open-source projects
- The formula sorts values from smallest to largest and computes a weighted average of ranks
- It is equivalent to the area between the Lorenz curve and the line of perfect equality, doubled
- The Lorenz curve plots cumulative share of contributors on the x-axis against cumulative share of commits on the y-axis
- Giger et al. used Gini to predict bug-prone files in Eclipse [Giger2011]
- Files where a single developer owned nearly all changes were more likely to contain bugs
- Ownership concentration is a measurable proxy for knowledge silos
Hero Developers
- Most open-source projects have one person — a hero developer — doing the majority of the work [Majumder2019]
- Typically more than 80% of commits come from roughly 20% of contributors
- The pattern holds across projects of very different sizes and ages
- Hero developers create risk: if they stop contributing, the project loses most of its institutional knowledge
- They also create measurement problems: their commit style dominates any aggregate statistic you compute
- Whether hero developers are a problem or just an efficient structure is a values question, not a statistical one
- Majumder et al. found that hero projects were not inherently lower quality, but they were more fragile
Data Quality and Sampling
- He et al. found that GitHub star counts are routinely inflated by bots and purchased services [He2024]
- A sample of "popular" projects selected by star count is not representative of real adoption
- Clean your sampling frame before you start mining: check for bot accounts, mirrored repos, and star-farming patterns
- Merging contributor identities is harder than it looks
- The same developer may appear under different names, email addresses, or usernames across commits
- Name disambiguation is an active research problem; ignoring it inflates your contributor count
Dirty Data in Version Control
- Flint et al. found that at least 35% of MSR papers use time-based data without cleaning it [Flint2021]
- Git timestamps are set by the committing machine's clock, which may be wrong
- Commits can be back-dated, cherry-picked across branches with old timestamps, or imported from another VCS
- A commit dated before the repository was created is a strong signal of dirty data
- Gold and Krinke argue that treating public Git data as ethically unconstrained is a mistake [Gold2020]
- Developers push code publicly to share software, not to participate in research
- Commit histories contain personal information: work hours, productivity patterns, professional relationships
- Mining that information without consent raises the same ethical questions as any other human subjects research
Code
"""Compute Gini coefficient and Lorenz curve for contributor data."""
import numpy as np
import polars as pl
def gini(values):
"""Compute Gini coefficient for an array of non-negative values."""
arr = np.sort(np.array(values, dtype=float))
n = len(arr)
index = np.arange(1, n + 1)
return (2 * (index * arr).sum() / (n * arr.sum())) - (n + 1) / n
projects = ["numpy", "scikit-learn", "shell-novice"]
for project in projects:
df = pl.read_csv(f"data/{project}_commits.csv")
g = gini(df["commit_count"].to_numpy())
top_share = df["commit_count"].max() / df["commit_count"].sum()
print(f"{project}: Gini = {g:.3f}, top contributor share = {top_share:.1%}")
Check Understanding
What does a Gini coefficient of 0.85 mean for the distribution of commits in a project?
A Gini coefficient of 0.85 means the distribution of commits is highly unequal. In practice, a value that high usually means a very small fraction of contributors — perhaps one or two people — account for most of the commits, while many contributors have made only one or two. It does not tell you who those people are or whether the concentration is a good or bad thing, only that it exists. Compare it with a value near 0, where every contributor has committed roughly the same number of times.
The following function has a bug. What is wrong and how do you fix it?
def gini(values):
arr = np.sort(values) # values is a Polars Series
n = len(arr)
index = np.arange(1, n + 1)
return (2 * (index * arr).sum() / (n * arr.sum())) - (n + 1) / n
def gini(values):
arr = np.sort(values) # values is a Polars Series
n = len(arr)
index = np.arange(1, n + 1)
return (2 * (index * arr).sum() / (n * arr.sum())) - (n + 1) / n
np.sort works on NumPy arrays, but values here is a Polars Series. Passing a Polars Series to np.sort without converting it first may silently produce wrong results or raise a type error depending on the version. The fix is to convert explicitly before sorting:
def gini(values):
arr = np.sort(np.array(values, dtype=float))
n = len(arr)
index = np.arange(1, n + 1)
return (2 * (index * arr).sum() / (n * arr.sum())) - (n + 1) / n
Adding dtype=float also guards against integer overflow when the values are large.
Flint et al. found that 35% of MSR papers use time-based data without cleaning it. What is one type of data-quality problem specific to Git timestamps?
Git timestamps are recorded by the committing machine's clock, which may be wrong. A developer who commits on a laptop with an incorrect system clock will produce commits timestamped in the past or future. Commits imported from another version control system (SVN, Mercurial) often carry the original repository's timestamps, which predate the Git repository's creation. Either problem breaks any analysis that uses commit order or time between commits as a variable.
Why does Gold and Krinke argue that public Git commit data does not automatically mean developers consented to being studied?
Developers make code public so that others can read and use it, not to participate in research. Consent to one use does not imply consent to all uses. Commit histories contain information that developers may not have intended to share as research data: work schedules, productivity patterns on specific days, and professional relationships between collaborators. Using that information in a study about individual behavior goes beyond what a typical developer would expect when pushing to a public repository, which is why Gold and Krinke argue for applying the same ethical standards used in human subjects research.
Exercises
Hero Developer Fraction
Compute the top-contributor share (commits by the most active contributor divided by total commits) for each of the three projects in the pre-collected dataset. Report which project is most concentrated. Then write two sentences about what a software team relying on that project should consider before assuming continued maintenance: one sentence about the practical risk and one sentence about what evidence would increase or decrease your concern.
Lines Added vs. Commit Count
Compute the Gini coefficient for lines added rather than commit count for each of the three projects. Report whether the ranking of projects by inequality changes when you switch metrics. Write two sentences explaining why commit count and lines added might give different pictures of contribution concentration — consider what kinds of contributions each metric captures and what each one is blind to.
Sampling Without Star Counts
He et al. showed that GitHub star counts can be artificially inflated by bots and purchased services, which means star count is a poor primary filter for selecting representative open-source projects. Design a two-step sampling procedure that avoids relying on star count as the main selection criterion. Write four sentences describing your procedure: what proxy you would use instead of stars, how you would define your initial population, what a second filter would eliminate, and what residual bias your procedure still cannot remove.
Ethical Limits of Commit Mining
Gold and Krinke argue that mining commit histories raises ethical questions even when the data is publicly accessible. Identify two specific pieces of information that appear in a typical commit history that a developer might not expect to be used in research. For each piece of information, write one sentence describing a potential harm that could result from including it in a published study without consent.
Timestamp Cleaning
Flint et al. found that 35% of MSR papers use time-based data without cleaning it. In the pre-collected commit dataset, check for commits with timestamps before the project's first known public release (use 2006-01-01 for NumPy, 2007-06-01 for scikit-learn, and 2014-01-01 for shell-novice as approximate lower bounds) and for commits with timestamps after today's date. Report how many such commits you find in each project. Then write one sentence explaining one plausible mechanism that could produce a commit with a timestamp in the future.