Tracking Change

Goals

Why Change Matters

Why do researchers usually study change over time rather than a single snapshot?

Plotting Over Time

Plot weekly influenza-like illness percentage over time as a line chart.

or

Using Polars and Altair, read fluwatch.csv, plot the weekly ILI percentage over time as a line chart coloured by year, and save it as flu_line.png.

i
import polars as pl
import altair as alt

df = pl.read_csv("fluwatch.csv", null_values=[""])
df = df.with_columns(
    pl.col("ili_pct").cast(pl.Float64, strict=False)
)
df = df.drop_nulls(subset=["ili_pct"])

chart = (
    alt.Chart(df)
    .mark_line(opacity=0.6, strokeWidth=1)
    .encode(
        x=alt.X("week:O", title="Week of Year"),
        y=alt.Y("ili_pct:Q", title="ILI (% of physician visits)"),
        color=alt.Color("year:N", title="Year", legend=None),
        tooltip=["year", "week", "ili_pct"],
    )
    .properties(title="Weekly Influenza-Like Illness Percentage — Canada (FluWatch)",
                width=700, height=300)
)
chart.save("flu_line.png")
print("Saved flu_line.png")

Fitting a Trend Line

Fit a trend line to the annual peak ILI values and show the slope.

or

Using Polars and Altair, compute the maximum weekly ILI percentage for each year, then plot those annual peaks with a regression trend line. Save the chart as flu_trend.png.

i
import polars as pl
import altair as alt

df = pl.read_csv("fluwatch.csv", null_values=[""])
df = df.with_columns(
    pl.col("ili_pct").cast(pl.Float64, strict=False),
    pl.col("year").cast(pl.Int64, strict=False),
)
df = df.drop_nulls(subset=["ili_pct", "year"])

annual_peak = (
    df.group_by("year")
    .agg(pl.col("ili_pct").max().alias("peak_ili"))
    .sort("year")
)

base = alt.Chart(annual_peak)
points = base.mark_point(filled=True, size=60).encode(
    x=alt.X("year:Q", title="Year"),
    y=alt.Y("peak_ili:Q", title="Peak ILI (%)"),
    tooltip=["year", "peak_ili"],
)
trend = base.transform_regression("year", "peak_ili").mark_line(color="red")

chart = (points + trend).properties(
    title="Annual Peak Flu Season Severity with Trend Line — Canada",
    width=550, height=300,
)
chart.save("flu_trend.png")
print("Saved flu_trend.png")

What a Regression Line Tells You

What does the trend line actually represent?

Seasonal Pattern vs. Long-Term Trend

How do I tell the difference between the winter flu spike and a real long-term change?

Checking the Trend Direction

How do I confirm the trend line is pointing in the right direction?

What is the slope of the regression line in the previous chart?

i
import polars as pl

df = pl.read_csv("fluwatch.csv", null_values=[""])
df = df.with_columns(
    pl.col("ili_pct").cast(pl.Float64, strict=False),
    pl.col("year").cast(pl.Int64, strict=False),
)
df = df.drop_nulls(subset=["ili_pct", "year"])

annual_peak = (
    df.group_by("year")
    .agg(pl.col("ili_pct").max().alias("peak_ili"))
    .sort("year")
)

x = annual_peak["year"].cast(pl.Float64)
y = annual_peak["peak_ili"]
slope = ((x - x.mean()) * (y - y.mean())).sum() / ((x - x.mean()) ** 2).sum()
print(f"Trend line slope: {slope:.4f} percentage points per year")
if slope > 0:
    print("Direction: upward (flu peaks becoming more severe on average)")
else:
    print("Direction: downward (flu peaks becoming less severe on average)")

Check Understanding

You plot annual FluWatch peaks from 1993 to 2023 and fit a trend line. The line has a slope of +0.12 percentage points per year. A classmate says "this proves Canadian flu seasons are getting worse." What is the correct interpretation, and what is missing?

A slope of +0.12 means Canadian flu peaks have increased by an average of about 0.12 percentage points per year over this period in this dataset. It does not prove flu is getting worse: the trend could reverse, surveillance methodology may have changed, and 0.12 pp per year may or may not be practically meaningful. What is missing: a confidence interval on the slope (which would show whether the trend is distinguishable from zero) and a check on whether any pandemic years (2009, 2020) are distorting the estimate.

The FluWatch line chart shows an enormous spike in 2009 that is much higher than any other year. You fit a trend line to the full dataset. How does this spike affect the slope, and what should you do about it?

The 2009 H1N1 pandemic caused an unusually severe flu season. That extreme point pulls the trend line upward if it falls late in the dataset. The right approach is to note the outlier explicitly, fit the trend line with and without 2009, and report both results. A single year driven by a pandemic is qualitatively different from a severe seasonal flu and should not silently drive a long-term trend estimate.

A classmate fits a trend line to raw weekly ILI percentages over all years and concludes "there is a slight upward trend in flu rates." What is wrong with this analysis?

Fitting a trend line to raw weekly data mixes the seasonal pattern with any long-term trend. If surveillance weeks shifted slightly over time (for example, more winter-week data in recent years due to expanded reporting), the raw weekly series will slope upward for reasons unrelated to actual flu severity. The correct approach is to compare annual peaks or annual means (i.e., comparable points across years) rather than raw weekly values.

You ask the LLM to compute the slope of the trend line and it returns -0.09. But when you look at the chart, the line clearly slopes upward to the right. What should you do?

A negative slope with an upward-sloping chart means the LLM computed the regression with the axes reversed i.e., fitted year as a function of ILI rather than ILI as a function of year, or reported the slope of a different variable. Ask the LLM: "In the regression, which variable is x (the predictor) and which is y (the response)?" Do not report the slope until the sign matches what you see in the chart.

Exercises

Pandemic Year

Remove the 2009 pandemic year from the annual peak data and refit the trend line. Does the slope change meaningfully? Report both slopes and explain what the comparison shows.

Regional Trends

FluWatch data includes provincial and regional breakdowns. Pick two provinces and plot their annual peaks on the same chart. Are their trends parallel, or does one province show a stronger change than the other?

Early vs. Late Period

Split the data at 2010 and compute a trend line for the period before 2010 and a separate one for the period after. Are the slopes similar or different? What might explain a difference?

Seasonality Index

For each week number (1 through 52), compute the mean ILI percentage across all years. Plot the result as a line chart. In which week does flu typically peak in Canada?

Slope With the Wrong Sign

The following code computes the slope of the trend in annual flu peaks, but the sign is wrong: the printed slope says "downward" while the chart clearly rises. Work with an LLM to find the error in the regression and fix it.

i
import polars as pl

df = pl.read_csv("fluwatch.csv", null_values=[""])
df = df.with_columns(
    pl.col("ili_pct").cast(pl.Float64, strict=False),
    pl.col("year").cast(pl.Int64, strict=False),
)
df = df.drop_nulls(subset=["ili_pct", "year"])

annual_peak = (
    df.group_by("year")
    .agg(pl.col("ili_pct").max().alias("peak_ili"))
    .sort("year")
)

x = annual_peak["peak_ili"]
y = annual_peak["year"].cast(pl.Float64)
slope = ((x - x.mean()) * (y - y.mean())).sum() / ((x - x.mean()) ** 2).sum()
print(f"Trend line slope: {slope:.4f}")
if slope > 0:
    print("Direction: upward (flu peaks becoming more severe on average)")
else:
    print("Direction: downward (flu peaks becoming less severe on average)")
How do you know the fix worked?

The sign of the slope should match the visual direction of the trend line in flu_trend.png. Also check that the units make sense: the slope should be in percentage points per year, so a value like +0.05 is plausible, but +200 is not.

Mean Instead of Peak

The following code is meant to plot the annual peak ILI percentage, but the values are noticeably lower than the peaks visible in the weekly line chart. Work with an LLM to identify the wrong aggregation and fix it.

i
import polars as pl
import altair as alt

df = pl.read_csv("fluwatch.csv", null_values=[""])
df = df.with_columns(
    pl.col("ili_pct").cast(pl.Float64, strict=False),
    pl.col("year").cast(pl.Int64, strict=False),
)
df = df.drop_nulls(subset=["ili_pct", "year"])

annual_peak = (
    df.group_by("year")
    .agg(pl.col("ili_pct").mean().alias("peak_ili"))
    .sort("year")
)

base = alt.Chart(annual_peak)
points = base.mark_point(filled=True, size=60).encode(
    x=alt.X("year:Q", title="Year"),
    y=alt.Y("peak_ili:Q", title="Peak ILI (%)"),
)
trend = base.transform_regression("year", "peak_ili").mark_line(color="red")
(points + trend).properties(title="Annual Peak Flu Season Severity",
                             width=550, height=300).save("flu_trend.png")
print(f"Annual peak range: {annual_peak['peak_ili'].min():.2f} to {annual_peak['peak_ili'].max():.2f}")
How do you know the fix worked?

For each year, the peak value should be at least as large as any individual weekly value that year. Pick one year from the raw data, find the highest weekly ILI value yourself, and confirm the annual summary matches.

Trend on Raw Weekly Data

The following code fits a trend line and prints the number of data points used, but the count is far higher than the number of years in the dataset. Work with an LLM to explain why and rewrite the code to fit the trend to annual peaks instead.

i
import polars as pl

df = pl.read_csv("fluwatch.csv", null_values=[""])
df = df.with_columns(
    pl.col("ili_pct").cast(pl.Float64, strict=False),
    pl.col("year").cast(pl.Int64, strict=False),
)
df = df.drop_nulls(subset=["ili_pct", "year"])

x = df["year"].cast(pl.Float64)
y = df["ili_pct"]
slope = ((x - x.mean()) * (y - y.mean())).sum() / ((x - x.mean()) ** 2).sum()

print(f"Data points used: {len(df)}")
print(f"Trend line slope: {slope:.4f} percentage points per year")
if slope > 0:
    print("Direction: upward (flu becoming more severe on average)")
else:
    print("Direction: downward (flu becoming less severe on average)")
How do you know the fix worked?

After fixing, the number of data points used should equal the number of distinct years in fluwatch.csv. Compare the slope from the corrected version to the slope from check_slope.py.

Before and After Removing the Pandemic Year

The following code computes the trend slope using all available years. Work with an LLM to extend it so it also computes and prints the slope after removing the 2009 pandemic year from the annual peak data.

i
import polars as pl

PANDEMIC_YEAR = 2009

df = pl.read_csv("fluwatch.csv", null_values=[""])
df = df.with_columns(
    pl.col("ili_pct").cast(pl.Float64, strict=False),
    pl.col("year").cast(pl.Int64, strict=False),
)
df = df.drop_nulls(subset=["ili_pct", "year"])

annual_peak = (
    df.group_by("year")
    .agg(pl.col("ili_pct").max().alias("peak_ili"))
    .sort("year")
)

x = annual_peak["year"].cast(pl.Float64)
y = annual_peak["peak_ili"]
slope = ((x - x.mean()) * (y - y.mean())).sum() / ((x - x.mean()) ** 2).sum()
print(f"Slope (all years): {slope:.4f} percentage points per year")
# TODO: refit the slope after removing PANDEMIC_YEAR from annual_peak,
# then print both slopes so the reader can compare them
How do you know the addition is correct?

Print the number of rows used in each regression to confirm the second fit uses one fewer year. If removing 2009 changes the slope noticeably, explain in one sentence why that year has such a large effect on the estimate.