First Steps

Goals

Starting the Notebook

How do I start a notebook and create a first cell?

What should I type in the first cell to make sure everything is working?

Reading the Data

Read a CSV file of monthly climate observations and print its first five rows.

Use Polars to read a CSV file called climate.csv, skip the first row which is a title line, treat empty cells as missing, and print the first five rows.

i
import polars as pl

df = pl.read_csv("climate.csv", skip_rows=1, null_values=[""])
print(df.head())

Mean and Median

Compute the mean and median of the monthly mean temperature column.

Using Polars, read climate.csv the same way, then compute and print the mean and median of the 'Mean Temp (°C)' column, excluding missing values.

i
import polars as pl

df = pl.read_csv("climate.csv", skip_rows=1, null_values=[""])
temps = df["Mean Temp (°C)"].drop_nulls()
print(f"Mean:   {temps.mean():.2f} °C")
print(f"Median: {temps.median():.2f} °C")

Why would the mean and median of a dataset ever be different numbers?

Saving the Notebook

What does it mean to save a notebook, and why does it matter?

Saving Prompts

I am tired of typing "use Polars" and "treat empty cells as missing" in every prompt. Is there a way to set these once?

Check Understanding

You run the cell with read_climate.py and see a column of values like 14567, 15023, -9999. None of these look like temperatures. What are two likely causes, and how would you diagnose each?

The first likely cause is reading the wrong column: the code may be using a station ID or a precipitation total instead of mean temperature. Check the column names printed in the output header to confirm which column was used. The second likely cause is that skip_rows=1 was wrong and the first data row was skipped or misaligned, shifting all values. Print df.head(1) and compare it to the raw CSV to see whether the rows line up correctly.

The mean monthly temperature for your station is -2.8 °C but the median is -1.1 °C. What does this tell you about the shape of the data?

The mean is lower than the median, which means the data is skewed toward colder values. A few very cold winter months (perhaps during extreme cold snaps) pull the mean downward while the median stays closer to the centre of the distribution. This is a common pattern for Canadian climate data, where extreme cold events are more severe than extreme warm events at most stations.

A classmate runs their notebook, gets a correct result, then closes it and reopens it the next day. They see the output from yesterday still displayed. They say "the analysis still works." What is wrong with this reasoning?

Jupyter displays the last-run output even after the kernel shuts down. The visible result from yesterday does not mean the code will produce the same result today: it is a snapshot, not a live computation. To confirm the analysis still works, they need to restart the kernel and run all cells from top to bottom.

Your research partner works with data on Canadian home sale prices where a handful of luxury waterfront properties are in the sample. They want to report "the typical sale price." Should they use the mean or the median? Explain your reasoning.

The median. A few multi-million-dollar properties would pull the mean far above what a typical buyer pays. The median is the middle value and is not affected by those extremes, so it better represents what a typical property in the sample sells for.

Exercises

Different Seasons

Compute the mean and median for January temperatures and July temperatures separately. Are they closer together within a season than across the full year? Why might that be?

Plot Over Time

Ask the LLM to make a line chart of mean annual temperature over time (one point per year). Save it as a PNG. Describe in one sentence what the chart shows about temperature change at your station.

Count Missing Values

Ask the LLM to count how many missing values appear in each column. Which columns tend to have more missing data, and why might that be for a weather station dataset?

Compare Two Stations

Download data for a second station in a different region (e.g., one coastal and one inland). Ask the LLM to compute the mean annual temperature for each station and display them side by side. Which station is warmer on average?

Skip the Right Rows

The following code reads the climate CSV and tries to compute mean and median temperature, but the first line of output looks like a column header rather than a real measurement. Work with an LLM to find and fix the problem.

i
import polars as pl

df = pl.read_csv("climate.csv", skip_rows=0, null_values=[""])
print(df.head())
temps = df["Mean Temp (°C)"].drop_nulls()
print(f"Mean:   {temps.mean():.2f} °C")
print(f"Median: {temps.median():.2f} °C")
How do you know the fix worked?

Check that df.head(1) shows a four-digit year in the Year column and a plausible temperature in Mean Temp (°C), not the strings "Year" or "Mean Temp (°C)".

Right Number, Wrong Column

The following code reads the climate CSV and prints a mean and median, but the values are implausibly large for temperatures in Canada. Work with an LLM to identify which column is actually being used and correct it.

i
import polars as pl

df = pl.read_csv("climate.csv", skip_rows=1, null_values=[""])
temps = df["Total Precip (mm)"].drop_nulls()
print(f"Mean:   {temps.mean():.2f} °C")
print(f"Median: {temps.median():.2f} °C")
How do you know the fix worked?

For most Canadian stations, mean annual temperature is between -10 and +15 °C. Print the column name alongside the result and confirm it says Mean Temp (°C).

Missing Values Not Declared

The following code reads the climate CSV and computes mean and median, but both values print as None. Work with an LLM to find out why and fix the code.

i
import polars as pl

df = pl.read_csv("climate.csv", skip_rows=1)
print(f"Temperature column type: {df['Mean Temp (°C)'].dtype}")
temps = df["Mean Temp (°C)"].drop_nulls()
print(f"Non-null rows: {len(temps)}")
print(f"Mean:   {temps.mean()}")
print(f"Median: {temps.median()}")
How do you know the fix worked?

After fixing, df["Mean Temp (°C)"].dtype should show Float64, not String, and both the mean and median should be numbers, not None.

More Than Mean and Median

The following code computes the mean and median monthly temperature. Work with an LLM to extend it so it also prints the minimum value, the maximum value, and the count of non-null observations.

i
import polars as pl

df = pl.read_csv("climate.csv", skip_rows=1, null_values=[""])
col = "Mean Temp (°C)"
temps = df[col].drop_nulls()
print(f"Mean:   {temps.mean():.2f} °C")
print(f"Median: {temps.median():.2f} °C")
# TODO: also print the minimum value, maximum value, and count of non-null observations
How do you know the additions are correct?

Confirm that minimum + maximum are consistent with the seasonal range at your station, and that count + null count equals the total number of rows in the dataframe.