Mining Software Repositories

Learning Goals

Git Objects and History

Measuring Contribution

Gini Coefficient and Lorenz Curve

Hero Developers

Data Quality and Sampling

Dirty Data in Version Control

Code

i
"""Compute Gini coefficient and Lorenz curve for contributor data."""

import numpy as np
import polars as pl


def gini(values):
    """Compute Gini coefficient for an array of non-negative values."""
    arr = np.sort(np.array(values, dtype=float))
    n = len(arr)
    index = np.arange(1, n + 1)
    return (2 * (index * arr).sum() / (n * arr.sum())) - (n + 1) / n


projects = ["numpy", "scikit-learn", "shell-novice"]
for project in projects:
    df = pl.read_csv(f"data/{project}_commits.csv")
    g = gini(df["commit_count"].to_numpy())
    top_share = df["commit_count"].max() / df["commit_count"].sum()
    print(f"{project}: Gini = {g:.3f}, top contributor share = {top_share:.1%}")

Check Understanding

What does a Gini coefficient of 0.85 mean for the distribution of commits in a project?

A Gini coefficient of 0.85 means the distribution of commits is highly unequal. In practice, a value that high usually means a very small fraction of contributors — perhaps one or two people — account for most of the commits, while many contributors have made only one or two. It does not tell you who those people are or whether the concentration is a good or bad thing, only that it exists. Compare it with a value near 0, where every contributor has committed roughly the same number of times.

The following function has a bug. What is wrong and how do you fix it?
def gini(values):
    arr = np.sort(values)  # values is a Polars Series
    n = len(arr)
    index = np.arange(1, n + 1)
    return (2 * (index * arr).sum() / (n * arr.sum())) - (n + 1) / n

np.sort works on NumPy arrays, but values here is a Polars Series. Passing a Polars Series to np.sort without converting it first may silently produce wrong results or raise a type error depending on the version. The fix is to convert explicitly before sorting:

def gini(values):
    arr = np.sort(np.array(values, dtype=float))
    n = len(arr)
    index = np.arange(1, n + 1)
    return (2 * (index * arr).sum() / (n * arr.sum())) - (n + 1) / n

Adding dtype=float also guards against integer overflow when the values are large.

Flint et al. found that 35% of MSR papers use time-based data without cleaning it. What is one type of data-quality problem specific to Git timestamps?

Git timestamps are recorded by the committing machine's clock, which may be wrong. A developer who commits on a laptop with an incorrect system clock will produce commits timestamped in the past or future. Commits imported from another version control system (SVN, Mercurial) often carry the original repository's timestamps, which predate the Git repository's creation. Either problem breaks any analysis that uses commit order or time between commits as a variable.

Why does Gold and Krinke argue that public Git commit data does not automatically mean developers consented to being studied?

Developers make code public so that others can read and use it, not to participate in research. Consent to one use does not imply consent to all uses. Commit histories contain information that developers may not have intended to share as research data: work schedules, productivity patterns on specific days, and professional relationships between collaborators. Using that information in a study about individual behavior goes beyond what a typical developer would expect when pushing to a public repository, which is why Gold and Krinke argue for applying the same ethical standards used in human subjects research.

Exercises

Hero Developer Fraction

Compute the top-contributor share (commits by the most active contributor divided by total commits) for each of the three projects in the pre-collected dataset. Report which project is most concentrated. Then write two sentences about what a software team relying on that project should consider before assuming continued maintenance: one sentence about the practical risk and one sentence about what evidence would increase or decrease your concern.

Lines Added vs. Commit Count

Compute the Gini coefficient for lines added rather than commit count for each of the three projects. Report whether the ranking of projects by inequality changes when you switch metrics. Write two sentences explaining why commit count and lines added might give different pictures of contribution concentration — consider what kinds of contributions each metric captures and what each one is blind to.

Sampling Without Star Counts

He et al. showed that GitHub star counts can be artificially inflated by bots and purchased services, which means star count is a poor primary filter for selecting representative open-source projects. Design a two-step sampling procedure that avoids relying on star count as the main selection criterion. Write four sentences describing your procedure: what proxy you would use instead of stars, how you would define your initial population, what a second filter would eliminate, and what residual bias your procedure still cannot remove.

Ethical Limits of Commit Mining

Gold and Krinke argue that mining commit histories raises ethical questions even when the data is publicly accessible. Identify two specific pieces of information that appear in a typical commit history that a developer might not expect to be used in research. For each piece of information, write one sentence describing a potential harm that could result from including it in a published study without consent.

Timestamp Cleaning

Flint et al. found that 35% of MSR papers use time-based data without cleaning it. In the pre-collected commit dataset, check for commits with timestamps before the project's first known public release (use 2006-01-01 for NumPy, 2007-06-01 for scikit-learn, and 2014-01-01 for shell-novice as approximate lower bounds) and for commits with timestamps after today's date. Report how many such commits you find in each project. Then write one sentence explaining one plausible mechanism that could produce a commit with a timestamp in the future.