Tidy Data and Polars Basics

Learning Goals

Lesson

i
"""Compute PEP 8 line-length compliance from file data."""

import polars as pl

df = pl.read_csv("data/line_lengths.csv")
print("Shape:", df.shape)
print("Columns:", df.columns)
print(df.head())

clean = df.drop_nulls("line_length")
total = clean["count"].sum()
over79 = clean.filter(pl.col("line_length") > 79)["count"].sum()
print(f"Lines over 79 chars: {over79:,} / {total:,} = {over79 / total:.1%}")
clean.write_csv("data/line_lengths_clean.csv")

Check Understanding

What are the three rules of tidy data? Give a concrete example of real data that violates each rule.

The three rules are: (1) each column is one variable, (2) each row is one observation, (3) one table per set of related observations. A spreadsheet that stores monthly sales figures in columns named jan, feb, mar violates rule 1 because the month is a variable that belongs in its own column. A CSV where some rows hold individual transaction data and other rows hold subtotals for a category violates rule 2 because a subtotal row is not an observation of the same kind as a transaction row. A single table that combines customer information (name, address) with order information (item, quantity, price) violates rule 3 because those are two different sets of observations that should be normalized into separate tables joined by a customer identifier.

The following code is supposed to double all positive values in a column, but it will raise an error in Polars. What is wrong and how do you fix it?
df = pl.read_csv("data.csv")
result = df.filter(pl.col("value") > 0)
result["value"] = result["value"] * 2  # this line

Polars dataframes are immutable: you cannot modify a column in place using index assignment the way you can with a Pandas dataframe. The line result["value"] = ... will raise a TypeError. The fix is to use with_columns to create a new dataframe with the modified column:

df = pl.read_csv("data.csv")
result = (df
          .filter(pl.col("value") > 0)
          .with_columns((pl.col("value") * 2).alias("value")))
Why is it wrong to fill null line lengths with 0 when computing PEP 8 compliance?

A null line length means the length of that line is unknown, not that the line has zero characters. Replacing null with 0 would classify every unknown-length line as compliant with PEP 8 (which requires lines to be 79 characters or shorter), because 0 is less than 79. This artificially inflates the compliance rate. The correct approach is to drop rows with null line lengths so they do not contribute to either the numerator or denominator of the compliance fraction, and then note in any report that some lines had missing length data.

What does the Acciai et al. finding imply for a study that relies on requesting data from paper authors?

It implies that the resulting dataset will not be a random sample of all relevant research. Researchers with names associated with certain ethnic backgrounds are less likely to respond to data requests, so the study will systematically underrepresent work from those groups. Any analysis built on that data will generalize only to the subset of research that was willing to share, which may differ from the full population in ways that matter for the conclusions. This is a form of selection bias that the study design itself introduced, not something in the underlying research being studied.

Exercises

Compliance by Directory

The line-length dataset includes a column indicating which part of the Python installation each file came from (for example, site-packages versus bin). Filter the cleaned data to each directory category separately and compute the fraction of lines exceeding 79 characters for each one. Present the results as a Polars dataframe with one row per directory. Write one sentence interpreting any difference you find: does it make sense that one category would have different style compliance than another?

Two Ways to Handle Nulls

Compute the fraction of lines exceeding 79 characters in two ways: first by dropping all rows where line_length is null before computing the fraction, and second by filling null line lengths with 0 before computing the fraction. Report both fractions. Explain in two or three sentences which approach is more defensible for a study about PEP 8 compliance, and under what circumstances (if any) filling with 0 would be the correct choice.

File-Level Versus Line-Level Compliance

The dataset allows you to compute compliance in two ways. The line-level rate is the fraction of individual lines that are 79 characters or shorter. The file-level rate is the fraction of files where every line is 79 characters or shorter. Compute both rates using the cleaned data and report which is larger. Write two or three sentences explaining which rate is more useful for a team that wants to enforce a style guide, and which is more useful for a researcher comparing style conventions across programming communities.

Alternative Data Sources

You are replicating a study that originally required data from paper authors, but you suspect that bias in data-sharing responses (as documented by Acciai et al. [Acciai2023]) may have skewed the original dataset. Propose two alternative data sources you could use instead of contacting authors directly. For each source, describe what data it provides, what bias it might introduce, and how you would check whether the alternative source gives similar results to the original data where overlap exists.

Making and Repairing Untidy Data

Take the cleaned line-length data and deliberately restructure it in two different ways that violate the tidy data rules: once by creating a version that violates rule 1 (one variable per column) and once by creating a version that violates rule 2 (one observation per row). Write the untidy versions to CSV files. Then write Polars code that reads each untidy file and transforms it back into a tidy dataframe equivalent to the original. Include a comment in each transformation explaining which tidy rule was violated and what operation restores it.