Joining Data

Goals

Why Data Arrives in Pieces

Why would data about the same topic be stored in two separate files?

What is a join?

Joining the Tables

Combine the temperature dataframe and the stations dataframe so each temperature row also shows the station's province and elevation.

or

Using Polars, read ahccd_temp.csv and ahccd_stations.csv. Join them so each temperature row gains the matching station's province and elevation columns. Use a left join on the station_id column. Print the first five rows of the result.

i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
print("Temperature file columns:", temps.columns)
print(f"Temperature rows: {temps.height}")

stations = pl.read_csv("ahccd_stations.csv", null_values=[""])
print("\nStation file columns:", stations.columns)
print(f"Station rows: {stations.height}")
i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
stations = pl.read_csv("ahccd_stations.csv", null_values=[""])

combined = temps.join(
    stations.select(["station_id", "province", "elevation_m"]),
    on="station_id",
    how="left",
)
print(combined.head())

Checking the Join Result

How do I know the join worked correctly?

i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
stations = pl.read_csv("ahccd_stations.csv", null_values=[""])

combined = temps.join(
    stations.select(["station_id", "province", "elevation_m"]),
    on="station_id",
    how="left",
)

null_province = combined["province"].null_count()
print(f"Rows with null province after join: {null_province} of {combined.height}")

Comparing Groups

Show me the mean annual temperature by province.

or

Using the joined dataframe, compute the mean annual temperature for each province. Sort from warmest to coolest. Drop rows where temperature or province is null.

i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
stations = pl.read_csv("ahccd_stations.csv", null_values=[""])

combined = temps.join(
    stations.select(["station_id", "province", "elevation_m"]),
    on="station_id",
    how="left",
)

mean_by_province = (
    combined.drop_nulls(subset=["temp_mean", "province"])
    .group_by("province")
    .agg([
        pl.col("temp_mean").mean().alias("mean_temp_c"),
        pl.col("temp_mean").count().alias("n_readings"),
    ])
    .sort("mean_temp_c", descending=True)
)
print(mean_by_province)

What does it mean when two provinces have different mean temperatures?

Counting Rows Before and After

How do I detect if the join accidentally multiplied my rows?

i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
stations = pl.read_csv("ahccd_stations.csv", null_values=[""])

print(f"Temperature rows before join: {temps.height}")

combined = temps.join(
    stations.select(["station_id", "province", "elevation_m"]),
    on="station_id",
    how="left",
)
print(f"Rows after join: {combined.height}")

if combined.height > temps.height:
    print("WARNING: join multiplied rows. Check for duplicate station IDs.")
    dupes = (
        stations.group_by("station_id")
        .count()
        .filter(pl.col("count") > 1)
    )
    print(f"Stations with duplicate entries: {dupes.height}")

Check Understanding

After a left join, the province column has 3 500 null values where it had zero before. The temperature file has 200 000 rows. What is the most likely cause, and how would you find out?

The most likely cause is that some station IDs in the temperature file do not appear in the station inventory. They may use a different format (with or without leading zeros, with or without a country prefix). To diagnose: find the distinct station_id values in temperature rows where province is null, and check whether any of those IDs appear in the station file in a slightly different format. If they match after stripping a prefix, ask the LLM to normalize the IDs before joining.

The joined dataframe has 210 000 rows but the temperature file has only 200 000. A classmate says "the join added information so more rows is expected." Are they right?

No. A left join cannot add rows: every row in the result corresponds to exactly one row in the left table. Having more rows after the join means the station inventory has duplicate entries for some station IDs. Each temperature reading for a duplicated station matched multiple station rows and was replicated. Find duplicated station IDs with stations.group_by("station_id").count().filter(pl.col("count") > 1).

You find that Alberta has a mean annual temperature of 2.1 °C and Prince Edward Island has 6.8 °C. A classmate says "this proves maritime provinces are warmer than prairie ones." What is a more careful interpretation?

The AHCCD stations in Alberta include many in the far north and at high elevations, while PEI stations are all at low elevation near the ocean. The difference reflects station geography as much as provincial climate. A fairer comparison would control for latitude and elevation, or compare only stations in similar geographic settings. "Maritime provinces appear warmer than prairie provinces in this dataset" is accurate; "maritime provinces are warmer" is an overclaim.

You want to compute the mean temperature for the Arctic vs. non-Arctic parts of Canada. The temperature file has station IDs and the station file has latitude. Describe the two steps needed before grouping by Arctic status.

First, join the temperature file to the station file on station_id so each temperature row gains the latitude column. Second, add a column that classifies each station as Arctic (latitude above 60° N) or non-Arctic, then group by that classification and compute the mean temperature. Without the join, the temperature file has no latitude information.

Exercises

Elevation Effect

Bin stations into three elevation groups: below 300 m, 300-1 000 m, and above 1 000 m. Ask the LLM to compute the mean annual temperature for each group. Is there a consistent relationship between elevation and temperature?

Longest Records

Find the ten stations with the longest continuous temperature records (most years of data). What provinces are they in? Is there a geographic pattern?

Missing Data by Province

Compute the proportion of null temperature values for each province. Is missing data evenly distributed, or do some provinces have much less data? What might cause that pattern?

Station Count per Province

After the join, count how many distinct stations each province has in the temperature file. Compare the station count to the province's area. Are larger provinces better-covered?

Inner Join Loses Rows

The following code joins the temperature and station files, but the combined row count is smaller than the number of temperature records. Work with an LLM to explain why rows disappeared and fix the join type.

i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
stations = pl.read_csv("ahccd_stations.csv", null_values=[""])

print(f"Temperature rows before join: {len(temps)}")

combined = temps.join(
    stations.select(["station_id", "province", "elevation_m"]),
    on="station_id",
    how="inner",
)
print(f"Rows after join: {len(combined)}")
print(combined.head())
How do you know the fix worked?

After fixing, the row count of the joined result should equal len(temps). If it is smaller, some temperature records had no matching station and were dropped.

Join Key Capitalisation

The following code attempts to join the two files but crashes with an error about a missing column. Work with an LLM to find the column name mismatch and fix it.

i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
stations = pl.read_csv("ahccd_stations.csv", null_values=[""])

combined = temps.join(
    stations.select(["station_id", "province", "elevation_m"]),
    left_on="Station_ID",
    right_on="station_id",
    how="left",
)
null_province = combined["province"].null_count()
print(f"Rows with null province: {null_province} of {combined.height}")
print(combined.head())
How do you know the fix worked?

Print temps.columns and stations.columns side by side and confirm the key column appears with exactly the same name and capitalisation in both. After fixing, the null province count should be zero or very small.

Detecting Duplicate Station IDs

The following code joins the temperature and station files and prints the row counts. Work with an LLM to extend it so it also checks whether any station_id appears more than once in the station file, and prints those duplicated IDs if found.

i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
stations = pl.read_csv("ahccd_stations.csv", null_values=[""])

print(f"Temperature rows: {len(temps)}")

combined = temps.join(
    stations.select(["station_id", "province", "elevation_m"]),
    on="station_id",
    how="left",
)
print(f"Combined rows: {len(combined)}")
# TODO: check whether any station_id appears more than once in the stations file;
# if combined has more rows than temps, duplicates in stations are the cause
How do you know the addition is correct?

If combined has more rows than temps, duplicated station IDs in the station file are the cause. The list of duplicated IDs printed by the new code should account for the extra rows exactly.

Reporting Unmatched Stations

The following code joins the files and computes mean temperature by province, but does not report how many temperature records had no matching station. Work with an LLM to extend it so it prints the null province count and a sample of the unmatched station IDs before computing the group means.

i
import polars as pl

temps = pl.read_csv("ahccd_temp.csv", null_values=[""])
stations = pl.read_csv("ahccd_stations.csv", null_values=[""])

combined = temps.join(
    stations.select(["station_id", "province", "elevation_m"]),
    on="station_id",
    how="left",
)

mean_by_province = (
    combined.drop_nulls(subset=["mean_temp", "province"])
    .group_by("province")
    .agg(pl.col("mean_temp").mean().alias("mean_annual_temp"))
    .sort("mean_annual_temp")
)
print(mean_by_province)
# TODO: before printing mean_by_province, print how many temperature rows
# had a null province after the join, and print a sample of the unmatched
# station_id values so you can see why they did not match
How do you know the addition is correct?

The null count plus the non-null count should equal len(combined). Look at the unmatched station IDs and compare them to the station file to understand why they did not match.