Cleaning

Goals

Why Real Data Is Dirty

What kinds of problems appear in real datasets, and why?

Finding Missing Values

Show me how many missing or unknown values each column has.

or

Using Polars, read collision_data.csv, treat U, Q, and N as missing values, then print a table showing each column name and how many missing values it has, sorted from most to fewest.

i
import polars as pl

df = pl.read_csv("collision_data.csv", null_values=["U", "Q", "N", "UU", "QQ", "NN"])

missing = (
    df.null_count()
    .transpose(include_header=True, header_name="column", column_names=["missing"])
    .filter(pl.col("missing") > 0)
    .sort("missing", descending=True)
)
print(missing)

Checking Category Values

Show me every distinct value in the sex column.

or:

Using Polars, read collision_data.csv with U, Q, and N as missing, then print every distinct value in the P_SEX column and how many times each appears.

i
import polars as pl

df = pl.read_csv("collision_data.csv", null_values=["U", "Q", "N", "UU", "QQ", "NN"])
print(df["P_SEX"].value_counts().sort("count", descending=True))

Normalizing Values

Treat U and N as missing so only M and F remain in the sex column.

or:

Using Polars, read collision_data.csv with U, Q, and N as missing. Print the distinct values and counts in P_SEX after cleaning. Print the total number of rows before and after dropping rows where P_SEX is null.

i
import polars as pl

df = pl.read_csv("collision_data.csv", null_values=["U", "Q", "N", "UU", "QQ", "NN"])
print(f"Rows before cleaning: {len(df)}")

print("\nDistinct P_SEX values after null coding:")
print(df["P_SEX"].value_counts().sort("count", descending=True))

df_clean = df.drop_nulls(subset=["P_SEX"])
print(f"\nRows after dropping null P_SEX: {len(df_clean)}")

Understanding Sampling Bias

If collision records in remote areas are under-reported, what does that mean for our analysis?

Show me the distribution of P_PROV (province) in the data. Is it proportional to provincial population?

Why can an LLM not fix sampling bias for you?

Check Understanding

After treating U as missing in the P_SEX column, a classmate removes all rows where sex is null to get a "clean" dataset. What assumption are they making, and when is that assumption wrong?

They are assuming that rows with unknown sex are missing at random, i.e., that collisions where sex was not recorded are otherwise the same as collisions where it was. If sex is less likely to be recorded for pedestrians, cyclists, or collisions involving multiple vehicles, dropping those rows changes the composition of the dataset. The cleaned sample no longer represents the same population as the full collision database.

You run the missing-value check and find that the C_WTHR (weather condition) column is 40% missing. The LLM suggests filling in the missing values with "Clear" since that is the most common value. What is wrong with this approach?

Filling in 40% of a column with the most common value assumes that weather condition was missing randomly and that most missing records happened on clear days. Neither assumption is justified. Collisions in bad weather may be more likely to have incomplete police reports if responding officers were busy managing the scene. Filling missing weather values with "Clear" would make conditions appear better than they were for a large fraction of the data, biasing any analysis of weather effects on collision rates.

The LLM produces code that reads null_values=["U", "Q", "N"]. You run it and then print the distinct values of P_SEX. You still see "U" in the output. What likely went wrong?

Polars requires null_values to match the exact string in the file, including any surrounding whitespace. If the file stores " U" (with a leading space) rather than "U", the match fails and the value is left as a string. Print df["P_SEX"].unique() and look carefully at the values: use df["P_SEX"].str.strip().unique() to reveal hidden spaces. Then either add the padded version to null_values or strip the column before checking.

You want to compare collision rates per 100 000 people between provinces. After cleaning, Alberta has 80 000 records and Prince Edward Island has 2 000. Should you compare raw counts or rates? Why?

Compare rates (collisions per 100 000 people), not raw counts. Alberta has roughly 25 times the population of PEI, so more collisions are expected regardless of road safety differences. Dividing by population puts the provinces on a comparable scale. The raw counts differ because the populations differ, not necessarily because one province has more dangerous roads.

Exercises

Weather Conditions

Print the distinct values and counts for the C_WTHR (weather condition) column before and after treating unknown codes as missing. What proportion of collision records have known weather conditions?

Road Surface

The C_RSUR column codes the road surface condition (dry, wet, snow/ice, etc.). Ask the LLM to show the proportion of collisions for each road condition. Does icy or snowy road surface correspond to a higher proportion of fatal collisions (P_INJ = 1)?

Before and After

Ask the LLM to compute the percentage of collisions that resulted in at least one fatality (P_INJ = 1) before and after dropping rows where P_SEX is unknown. Does the fatality rate change? What does that tell you?

Provincial Distribution

Ask the LLM to count how many records come from each province (C_PROV). Compare the provincial shares to the 2021 census population shares. Which provinces appear over- or under-represented in the collision database?

Incomplete Null List

The following code reads the collision data and reports missing-value counts, but the number of null values is much lower than expected: the NCDB uses U, Q, and N as missing codes, but not all are declared. Work with an LLM to find what is missing and fix the null_values argument.

i
import polars as pl

df = pl.read_csv("collision_data.csv", null_values=["U"])

missing = (
    df.null_count()
    .transpose(include_header=True, header_name="column", column_names=["missing"])
    .filter(pl.col("missing") > 0)
    .sort("missing", descending=True)
)
print(missing)
print(f"\nDistinct C_WTHR values: {sorted(df['C_WTHR'].drop_nulls().unique().to_list())}")
How do you know the fix worked?

After fixing, the distinct values printed for C_WTHR should contain only numeric codes, not the letters Q or N. Re-run the missing-value count and confirm it is higher than before.

Hidden Whitespace in Categories

The following code counts province values, but prints more distinct provinces than Canada has. Work with an LLM to identify what is causing the extra groups and fix the code.

i
import polars as pl

df = pl.read_csv("collision_data.csv", null_values=["U", "Q", "N", "UU", "QQ", "NN"])

province_counts = (
    df["C_PROV"]
    .value_counts()
    .sort("count", descending=True)
)
print(f"Distinct province values: {province_counts.height}")
print(province_counts)
How do you know the fix worked?

Canada has 13 provinces and territories. After fixing, the number of distinct C_PROV values should be 13 or fewer. Print the unique values and confirm none look like duplicates with extra spaces.

Dropping Too Many Rows

The following code removes rows with unknown sex, but the row count after cleaning is far smaller than expected. Work with an LLM to explain why so many rows were dropped and rewrite the cleaning step.

i
import polars as pl

df = pl.read_csv("collision_data.csv", null_values=["U", "Q", "N", "UU", "QQ", "NN"])
print(f"Rows before cleaning: {len(df)}")

df_clean = df.drop_nulls()
print(f"Rows after cleaning: {len(df_clean)}")

print("\nDistinct P_SEX values after cleaning:")
print(df_clean["P_SEX"].value_counts())
How do you know the fix worked?

The correctly cleaned dataset should only lose the rows where P_SEX is null, not rows where any other column is null. Compare the before and after row counts to the count from normalize.py to verify.

Raw Counts vs. Rates

The following code prints the number of collision records per province, but raw counts are misleading because provinces have very different populations. Work with an LLM to add a column showing collisions per 100,000 people using 2021 census population figures.

i
import polars as pl

df = pl.read_csv("collision_data.csv", null_values=["U", "Q", "N", "UU", "QQ", "NN"])

province_counts = (
    df.group_by("C_PROV")
    .agg(pl.len().alias("collisions"))
    .sort("collisions", descending=True)
)
print(province_counts)
# TODO: add a column showing collisions per 100,000 people using 2021 census population
# figures for each province, then re-sort by rate instead of raw count
How do you know the addition is correct?

Compute one rate by hand for a province whose population you know, and confirm it matches the new column. Re-sort by rate: if the ordering changes significantly compared to raw counts, the rates are doing their job.