Lab: Python Coding Style at Scale

Learning Goals

The Study

Lab Workflow

Analysis Decisions That Change the Number

Wrap-Up: Every Number Is a Product of Choices

i
"""Replicate PEP 8 compliance analysis from Bafatakis et al. (2019)."""

import polars as pl
import altair as alt

df = pl.read_csv("data/line_lengths.csv").drop_nulls("line_length")
total = df["count"].sum()
over79 = df.filter(pl.col("line_length") > 79)["count"].sum()
print(f"Overall non-compliance: {over79 / total:.1%}")

# Split by directory type
lib = df.filter(pl.col("filepath").str.contains("site-packages"))
scripts = df.filter(~pl.col("filepath").str.contains("site-packages"))
for label, subset in [("Library", lib), ("Script", scripts)]:
    t = subset["count"].sum()
    o = subset.filter(pl.col("line_length") > 79)["count"].sum()
    print(f"{label} non-compliance: {o / t:.1%}")

# Histogram with PEP 8 limit marked
chart = (
    alt.Chart(df.to_pandas())
    .mark_bar()
    .encode(
        x=alt.X("line_length:Q", bin=alt.Bin(step=10), title="Line Length (chars)"),
        y=alt.Y("sum(count):Q", title="Total Lines"),
    )
    .properties(title="Distribution of Python Line Lengths")
)
rule = alt.Chart({"values": [{"x": 79}]}).mark_rule(color="red").encode(x="x:Q")
(chart + rule).save("figures/line_lengths.html")

Check Understanding

What three claims from Bafatakis et al. are you trying to reproduce in this lab?

The three claims are: (1) a substantial fraction of Python lines exceed the 79-character PEP 8 limit; (2) compliance varies by file purpose, with library code behaving differently from script code; and (3) the distribution of line lengths has a visible shape that a histogram can reveal. The first two are quantitative claims with specific numbers in the paper; the third is a qualitative claim about visual structure. Reproducing all three requires making and documenting choices about null handling, subset definition, and unit of analysis.

The code below tries to compute the non-compliance fraction for library files. What is wrong with it, and how do you fix it?
lib = df.filter(pl.col("filepath").contains("site-packages"))
non_compliant = lib.filter(pl.col("line_length") > 79).sum()
print(f"Non-compliant lines: {non_compliant}")

Two things are wrong. First, Polars string methods live under .str, so the correct call is pl.col("filepath").str.contains("site-packages") — without the .str accessor, Polars will raise an error. Second, calling .sum() on a filtered dataframe sums every column, not just the count column. The result will be a dataframe, not a scalar. To get the total number of non-compliant lines, select the count column first:

lib = df.filter(pl.col("filepath").str.contains("site-packages"))
non_compliant = lib.filter(pl.col("line_length") > 79)["count"].sum()
print(f"Non-compliant lines: {non_compliant}")
How does the unit of analysis (lines vs. files) affect the reported compliance rate?

Line-level compliance asks what fraction of all lines are 79 characters or fewer; file-level compliance asks what fraction of files have every line within the limit. The two rates can differ significantly. A file with 1,000 lines where 50 exceed 79 characters contributes 50 non-compliant lines to the line-level rate but counts as one fully non-compliant file. If long files tend to violate the limit more often than short files — which is plausible — the line-level rate will be higher than the file-level rate. Choosing one over the other without explanation is an undisclosed analysis decision.

Why is this an analogous analysis rather than a direct replication of Bafatakis et al.?

A direct replication would use the same data — Python snippets from Stack Overflow — processed by the same method. This lab uses line-length data from a local Python installation, which is a different population: installed libraries and local scripts rather than code written to answer specific questions in a public forum. The two populations may have very different compliance rates for reasons that have nothing to do with the PEP 8 limit. Calling this a direct replication would imply that the populations are interchangeable, which they are not.

Exercises

Null Handling Decision

Load the line-length dataset. Write two versions of the analysis: one that drops rows where line_length is null, and one that fills null line lengths with 0. Compute the overall non-compliance fraction for each version and report both numbers. Write one sentence stating your exact decision about null handling and why it is more defensible for this particular question, and one sentence explaining what a reader would need to know to reproduce your result from the same raw data.

File-Level vs. Line-Level Compliance

Compute two compliance rates from the same dataset: line-level compliance (the fraction of all lines that are 79 characters or fewer) and file-level compliance (the fraction of files where every line is within the limit). Report both rates and the difference between them. Write one sentence explaining which rate [Bafatakis2019] reports, and one sentence explaining which rate a developer who wants to know "how much of my codebase would pass a PEP 8 check?" should care about.

Line Length Histogram

Plot the distribution of line lengths as a histogram with 10-character bins using Altair. Add a vertical rule at x = 79 in a contrasting color to mark the PEP 8 limit. Write two sentences describing what the distribution tells you about how developers actually write Python: for example, whether there is a visible spike just below 79, whether the distribution cuts off sharply at the limit, or whether long lines are rare or common.

Non-Compliance by Subset

Compute the non-compliance fraction for three non-overlapping subsets of the data: files with "test" anywhere in the filepath, files under site-packages, and all remaining files. Present the three results as a Polars dataframe with one row per subset and columns for the subset name, total line count, non-compliant line count, and non-compliance fraction. Write one sentence identifying which subset has the highest non-compliance rate and proposing one reason why that subset might behave differently from the others.

Sensitivity to Analysis Choices

Identify three analysis decisions you made while working through this lab: for example, how you handled null values, how you defined "library" files, and whether you used line-level or file-level compliance. For each decision, compute the non-compliance fraction using the opposite choice — the one you did not make. Report the six resulting numbers (three pairs). Write one sentence summarizing which decision had the largest effect on the reported fraction, and one sentence explaining what this implies for how much you should trust a single published number without seeing the analysis code.