Tables

Goals

What a Dataframe Contains

What is a dataframe, and what does it look like?

Why do different columns have different types, and why does it matter?

Shape, Types, and Sample Rows

Load the HYDAT flow data and tell me how many rows and columns it has, what each column contains, and what the data types are.

or

Using Polars, read flow_data.csv and print: the number of rows and columns, the name and type of each column, and the first two rows.

i
import polars as pl

df = pl.read_csv("flow_data.csv")

print(f"Rows: {df.height}, Columns: {df.width}")
print()
print(df.schema)
print()
print(df.head(2))

Standard Deviation and Range

Compute the standard deviation and range of the streamflow column.

or

Compute the mean, standard deviation, minimum, maximum, and range of the FLOW column in flow_data.csv.

i
import polars as pl

df = pl.read_csv("flow_data.csv")

spread = df.select([
    pl.col("FLOW").mean().alias("mean"),
    pl.col("FLOW").std().alias("std_dev"),
    pl.col("FLOW").min().alias("minimum"),
    pl.col("FLOW").max().alias("maximum"),
    (pl.col("FLOW").max() - pl.col("FLOW").min()).alias("range"),
])
print(spread)

Interpreting High Spread

What does a large standard deviation mean for river flow data?

Why is spread as important as the mean?

Checking a Single Station

Filter to one known station and verify its row count against the dataset documentation.

or

Filter flow_data.csv to rows where STATION_NUMBER is '02GA010' and print the number of rows and the range of years covered.

Check Understanding

You load a CSV and see that the YEAR column has type String. When you ask the LLM to filter rows where YEAR is greater than 2000, the filter returns no results. What is wrong and how do you fix it?

Polars compares strings lexicographically, not numerically. "2001" > "2000" works alphabetically, but "999" > "2000" does not, because "9" comes after "2" in character order. More importantly, the filter pl.col("YEAR") > 2000 compares a string to an integer, which Polars may reject with an error or handle incorrectly. The fix is to cast the column first: df.with_columns(pl.col("YEAR").cast(pl.Int64)).

The mean monthly flow is 35 m³/s and the standard deviation is 120 m³/s. A classmate says "the data must be wrong because the standard deviation is larger than the mean." Are they right?

No. A standard deviation larger than the mean is common for streamflow data; it happens when the distribution is right-skewed. Most months have low flow near baseflow, but a few months during spring snowmelt have extremely high flows that pull the mean upward while the standard deviation reflects the wide spread. The correct response is to plot the distribution and check whether the high values are physically plausible, not to assume the data is wrong.

You filter to station 02GA010 and get 0 rows. The documentation says this station has records going back to 1912. List two things you would check.

First, check whether the station identifier column is named STATION_NUMBER in the actual CSV: it might be Station_Number, stn_id, or something else entirely. Print df.columns to see the exact names before filtering. Second, check whether the value "02GA010" is stored with the same capitalisation and zero-padding in the data. If the file uses "2GA010" without the leading zero, the filter will not match. Print df["STATION_NUMBER"].head() to see the actual format.

You compute a maximum flow of 99 999 m³/s for one station. The mean is 18 m³/s and all other values are below 2 000 m³/s. Does this suggest a data error? What would you do next?

Yes, 99 999 is a suspicious round number that may be a sentinel value used to indicate an error or a missing data code in HYDAT's historic records. Check the HYDAT data dictionary for what sentinel values are used. Filter to rows where FLOW equals 99 999 and examine those rows for other patterns: a specific year, a specific data quality symbol, or a note column. Do not include that value in any statistical computation until you have confirmed what it represents.

Exercises

Most Variable Rivers

Ask the LLM to compute the coefficient of variation (standard deviation divided by mean) for each station. Which stations have the highest coefficient of variation? Are they in regions you would expect to have highly variable flow?

Seasonal Pattern

Pick one station and ask the LLM to compute the mean flow for each month (January through December) averaged across all years. Plot the result as a bar chart. Does the seasonal pattern match what you would expect for the region?

Stations with Suspicious Values

Ask the LLM to find all rows where FLOW is negative, zero, or greater than 50 000 m³/s. How many rows have each kind of extreme value?

Long-Term Change

For one station with records going back at least fifty years, ask the LLM to compute the mean annual flow for each year and plot it over time. Is there a trend?

Column Forced to the Wrong Type

The following code loads the streamflow data and prints basic statistics, but mean, max, and min all print as None. Work with an LLM to explain why and fix the code.

i
import polars as pl

df = pl.read_csv("flow_data.csv", schema_overrides={"FLOW": pl.String})

print(f"FLOW column type: {df['FLOW'].dtype}")
print(f"Mean flow: {df['FLOW'].mean()}")
print(f"Max flow:  {df['FLOW'].max()}")
print(f"Min flow:  {df['FLOW'].min()}")
How do you know the fix worked?

After fixing, df["FLOW"].dtype should show Float64. The mean should be a positive number, and the maximum should be larger than the mean.

Sentinel Value Skewing the Statistics

The following code computes spread statistics for the FLOW column, but the reported maximum is an implausibly round large number. Work with an LLM to identify what kind of value this is, remove it, and recompute.

i
import polars as pl

df = pl.read_csv("flow_data.csv")

spread = df.select([
    pl.col("FLOW").mean().alias("mean"),
    pl.col("FLOW").std().alias("std_dev"),
    pl.col("FLOW").min().alias("minimum"),
    pl.col("FLOW").max().alias("maximum"),
])
print(spread)
How do you know the fix worked?

After removing sentinel values, the maximum should be physically plausible for a Canadian river (check the HYDAT station description for the station's known peak flow). Also confirm that the mean drops noticeably once the sentinel is removed.

Adding Coefficient of Variation

The following code computes mean and standard deviation per station. Work with an LLM to extend it so it also computes and prints the coefficient of variation (standard deviation divided by mean) for each station, sorted from highest to lowest.

i
import polars as pl

df = pl.read_csv("flow_data.csv")
df = df.filter(pl.col("FLOW").is_not_null() & (pl.col("FLOW") < 50_000))

station_stats = (
    df.group_by("STATION_NUMBER")
    .agg([
        pl.col("FLOW").mean().alias("mean_flow"),
        pl.col("FLOW").std().alias("std_flow"),
    ])
    .sort("std_flow", descending=True)
)
print(station_stats.head(10))
# TODO: add a column "cv" (coefficient of variation) equal to std_flow / mean_flow,
# then re-sort by cv descending and print the top 10
How do you know the addition is correct?

For one station, compute std / mean by hand from the values already printed and confirm it matches the new cv column. Stations in snowmelt-dominated regions should have higher coefficients of variation than groundwater-fed stations.

Column Name Capitalisation

The following code filters to one station and prints its row count, but the output says zero rows were found even though HYDAT records show this station has been active since 1912. Work with an LLM to find the cause and fix it.

i
import polars as pl

df = pl.read_csv("flow_data.csv")

station = df.filter(pl.col("Station_Number") == "02GA010")
print(f"Rows for station 02GA010: {len(station)}")
if len(station) > 0:
    print(f"Years covered: {station['YEAR'].min()} to {station['YEAR'].max()}")
else:
    print("No rows found.")
How do you know the fix worked?

Print df.columns before filtering to confirm the exact column name. After fixing, the row count should be roughly (current year - 1912) × 12, and the year range should start in or near 1912.