Descriptive Statistics

Learning Goals

Mean vs. Median

Variance and Standard Deviation

Percentiles

Skewness

When to Use a Log Scale

Replication Target

i
"""Descriptive statistics for PyPI release counts."""

import polars as pl

df = pl.read_csv("data/pypi_releases.csv")
counts = df["releases"]
print(f"Mean:   {counts.mean():.1f}")
print(f"Median: {counts.median():.1f}")
print(f"Std:    {counts.std():.1f}")
print(f"Max:    {counts.max()}")
for p in [10, 25, 75, 90]:
    print(f"  {p}th percentile: {counts.quantile(p / 100):.1f}")

Check Understanding

Why is the mean number of PyPI releases so much higher than the median?

The distribution of release counts is heavily right-skewed: most packages have a small number of releases, but a handful of packages have thousands. The mean is pulled upward by those extreme values, while the median reports the experience of the typical package. When one data point can be nearly 11,000 times larger than the median, the mean stops being a useful description of the center of the distribution.

The code below tries to report the spread of release counts in the original units. What is wrong with it, and how do you fix it?
df = pl.read_csv("data/pypi_releases.csv")
spread = df["releases"].variance()
print(f"Spread in releases: {spread:.1f} releases")

Variance is measured in squared units, not in the original units. Writing "releases" after the variance value is wrong: the number is in "releases²", which has no intuitive meaning. To report spread in the original units, use standard deviation instead:

spread = df["releases"].std()
print(f"Spread in releases: {spread:.1f} releases")

If the goal is to emphasize spread without being misled by outliers, the interquartile range is an even better choice for this skewed dataset.

When would a researcher reporting only the mean of a right-skewed distribution mislead readers?

Any time the distribution is heavily skewed, the mean describes a value that most observations are far below. If you report that the average PyPI package has 11 releases, a reader naturally imagines a typical package — but the typical package has only 4. The mean is dominated by the small number of packages with thousands of releases. A reader who acts on the mean (say, by setting an update-frequency benchmark) will set a standard that the vast majority of active packages cannot meet.

What does skewness measure, and why is most software engineering data positively skewed?

Skewness measures the asymmetry of a distribution: positive skew means the tail extends to the right, so a few very large values are far above the bulk of the data. SE data is positively skewed because most SE quantities — file sizes, function lengths, bug counts, release counts — are bounded below by zero and have no natural upper limit. The result is a pattern where most items are small, a few are medium-sized, and a small number are extremely large. This pattern appears in so many SE contexts that assuming a roughly normal distribution is almost always wrong.

Exercises

Zero-Release Packages

The PyPI dataset contains packages with zero releases. Decide whether to keep or drop them, then compute the mean and median both ways and report all four numbers. Write one sentence explaining which treatment is more appropriate for answering the question "how often do active packages get updated?" and one sentence explaining why the choice of inclusion criteria belongs in the methods section of any paper that uses this data.

Skewness Before and After Log Transform

Compute the skewness of the raw release-count distribution using scipy.stats.skew. Then add 1 to each release count (to handle zeros), take the natural logarithm, and compute the skewness of the transformed values. Report both skewness values. Write one sentence interpreting what the reduction in skewness tells you about which scale — raw or log — is more appropriate for summarizing this data, and one sentence explaining why you add 1 before taking the log.

Effect of Trimming the Top 1%

Identify the threshold for the top 1% of packages by release count and remove all packages above that threshold. Recompute the mean and median on the trimmed dataset. Report the original values, the trimmed values, and the percentage change for each statistic. Write two sentences explaining which statistic changed more and what that implies for how sensitive the mean is to extreme values compared to the median.

Limits of PyPI as a Sample

The lesson uses PyPI data to describe "software projects," but PyPI is not representative of all software. Identify one specific way in which PyPI packages are not representative — for example, in terms of project age, project size, programming language, or type of software. Write one sentence stating the limitation clearly, and one sentence proposing an alternative dataset or sampling strategy that would reduce it.

Visualizing Mean vs. Median

Plot the PyPI release-count distribution on a log scale using Altair. Add two vertical rules: one at the mean and one at the median. Use different colors or stroke patterns to distinguish them, and add a legend. Write two sentences explaining what a reader should take away from the gap between the two lines — specifically, what it implies about which statistic to report when summarizing this distribution for a general audience.