First Charts

Goals

Why Charts Come First

What is Anscombe's quartet, and why does it matter before we compute any statistics?

What dataset will we use, and what question will we ask?

Drawing the Scatter Plot

Make a scatter plot of earthquake depth on the x axis and magnitude on the y axis, coloured by region, and save it as a PNG.

i
import polars as pl
import altair as alt

df = pl.read_csv("earthquakes.csv", null_values=[""])
df = df.drop_nulls(subset=["depth", "magnitude", "region"])

chart = (
    alt.Chart(df)
    .mark_point(opacity=0.4, size=20)
    .encode(
        x=alt.X("depth:Q", title="Depth (km)"),
        y=alt.Y("magnitude:Q", title="Magnitude"),
        color=alt.Color("region:N", title="Region"),
        tooltip=["date", "depth", "magnitude", "region"],
    )
    .properties(title="Canadian Earthquakes: Depth vs. Magnitude",
                width=600, height=400)
)
chart.save("scatter.png")
print(f"Points in chart: {len(df)}")

Validating the Chart

How do I check that the chart shows all the data I expected?

i
import polars as pl

df = pl.read_csv("earthquakes.csv", null_values=[""])
print(f"Total rows in file: {len(df)}")

df_plot = df.drop_nulls(subset=["depth", "magnitude"])
print(f"Rows with both columns present: {len(df_plot)}")
print(f"Rows dropped: {len(df) - len(df_plot)}")

Measuring Correlation

Compute the correlation between earthquake depth and magnitude. Drop rows where depth or magnitude is missing.

i
import polars as pl

df = pl.read_csv("earthquakes.csv", null_values=[""])
df = df.drop_nulls(subset=["depth", "magnitude"])

r = df.select(pl.corr("depth", "magnitude")).item()
print(f"Correlation between depth and magnitude: {r:.3f}")
print(f"(computed from {len(df)} earthquakes with both values present)")

What kind of correlation did you calculate?

What does the correlation coefficient actually tell you?

Iterating on Prompts

The chart is hard to read because the points overlap. Adjust the prompt to improve it.

Check Understanding

Your scatter plot shows 12,000 points but the dataframe has 18,000 rows. The LLM says some rows were dropped because they had missing values. Is this a problem? How would you decide?

It depends on whether the missing values are random or systematic. If earthquakes in one region or detected by only a subset of seismograph networks are more likely to have missing depth data, the chart underrepresents those earthquakes and any correlation you compute is biased toward the ones with complete data. Ask the LLM to show how many rows are dropped per region and per year, then decide whether the pattern is random or whether it affects your conclusions.

You compute a correlation of 0.12 between depth and magnitude. A classmate says this proves deeper earthquakes are slightly more powerful. What is the correct interpretation?

A correlation of 0.12 is a weak positive relationship. It means deeper earthquakes in this dataset tend very slightly toward larger magnitudes on average, but the association is so weak that depth explains almost none of the variation in magnitude. "Proves" is too strong a word: the correlation is small and may reflect chance variation rather than a physical mechanism.

You ask the LLM to plot depth vs. magnitude, and all the points are compressed into the bottom-left corner. What should you add to the next prompt, and why?

Ask for log scales on both axes. Earthquake depth ranges from near-surface to over 600 km, and magnitude spans several orders of magnitude of energy release. A linear scale compresses most events into a small region. A log scale stretches the smaller values apart so the distribution of common earthquakes is visible alongside the rare large ones.

You ask the LLM to compute the correlation between two variables and it returns -0.03. You expected a strong relationship from the scatter plot. What is a likely explanation?

A correlation near zero with a visible pattern in the scatter plot usually means the relationship is not linear. The scatter plot may show a curved or clustered pattern (for example, two distinct populations of earthquakes at very different depths) that a linear correlation coefficient cannot capture. Ask the LLM to colour the points by a third variable such as region or fault type, or consider fitting a non-linear model.

Exercises

Magnitude Distribution

Plot a histogram of earthquake magnitudes. Describe the shape. Are large earthquakes (magnitude 5 or above) common or rare in the catalog?

BC vs. Quebec

Filter the dataset to BC earthquakes and Quebec earthquakes separately. Compute the mean magnitude and mean depth for each. Do the two regions have different seismic characteristics?

Remove Micro-Earthquakes

The catalog includes many micro-earthquakes (magnitude below 2.0) detected only by sensitive instruments. Filter them out and remake the scatter plot. Does the correlation change meaningfully?

Recent Decade

Filter to earthquakes recorded in the last ten years. Is the density of points different from the full catalog? What might explain the difference (hint: think about improvements to monitoring networks)?

Axes in the Wrong Order

The following code draws a scatter plot of depth and magnitude, but the chart looks odd: the x axis runs from 0 to 9 while the label says "Depth (km)." Work with an LLM to find the mismatch and fix it.

i
import polars as pl
import altair as alt

df = pl.read_csv("earthquakes.csv", null_values=[""])
df = df.drop_nulls(subset=["depth", "magnitude", "region"])

chart = (
    alt.Chart(df)
    .mark_point(opacity=0.4, size=20)
    .encode(
        x=alt.X("magnitude:Q", title="Depth (km)"),
        y=alt.Y("depth:Q", title="Magnitude"),
        color=alt.Color("region:N", title="Region"),
    )
    .properties(title="Canadian Earthquakes: Depth vs. Magnitude",
                width=600, height=400)
)
chart.save("scatter.png")
print(f"Points in chart: {len(df)}")
How do you know the fix worked?

Depth values should reach hundreds of kilometres along the x axis, while magnitude values along the y axis should stay between 0 and 9. Confirm the axis ranges match the column ranges in the dataframe.

Chart With No Points on One Axis

The following code draws a scatter plot, but the y axis is empty: all points sit at the bottom of the chart with no visible spread. Work with an LLM to explain why and fix the code.

i
import polars as pl
import altair as alt

df = pl.read_csv("earthquakes.csv", null_values=[""])
df = df.drop_nulls(subset=["depth", "magnitude"])

chart = (
    alt.Chart(df)
    .mark_point(opacity=0.4, size=20)
    .encode(
        x=alt.X("depth:Q", title="Depth (km)"),
        y=alt.Y("Magnitude:Q", title="Magnitude"),
    )
    .properties(title="Canadian Earthquakes: Depth vs. Magnitude",
                width=600, height=400)
)
chart.save("scatter.png")
print(f"Points in chart: {len(df)}")
How do you know the fix worked?

Open the saved PNG and confirm that points are spread vertically. Print df.columns to verify the column name the code uses exactly matches what is in the file.

Counting Instead of Binning

The following code is meant to plot a histogram of earthquake magnitudes, but every bar is exactly the same height. Work with an LLM to explain why and fix the encoding.

i
import polars as pl
import altair as alt

df = pl.read_csv("earthquakes.csv", null_values=[""])
df = df.drop_nulls(subset=["magnitude"])

chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X("magnitude", title="Magnitude"),
        y=alt.Y("count()", title="Number of Earthquakes"),
    )
    .properties(title="Earthquake Magnitude Distribution",
                width=400, height=300)
)
chart.save("magnitude_hist.png")
print(f"Rows in data: {len(df)}")
How do you know the fix worked?

Open the chart: bars should be tallest at low magnitudes and shortest at high magnitudes, reflecting the fact that small earthquakes far outnumber large ones.

Making Dense Points Visible

The following code draws the scatter plot without any transparency, so overlapping points form a solid mass in regions with many earthquakes. Work with an LLM to add opacity so individual points are visible through dense clusters.

i
import polars as pl
import altair as alt

df = pl.read_csv("earthquakes.csv", null_values=[""])
df = df.drop_nulls(subset=["depth", "magnitude", "region"])

chart = (
    alt.Chart(df)
    .mark_point(size=20)  # TODO: add opacity=0.4 so overlapping points are visible
    .encode(
        x=alt.X("depth:Q", title="Depth (km)"),
        y=alt.Y("magnitude:Q", title="Magnitude"),
        color=alt.Color("region:N", title="Region"),
    )
    .properties(title="Canadian Earthquakes: Depth vs. Magnitude",
                width=600, height=400)
)
chart.save("scatter.png")
print(f"Points in chart: {len(df)}")
How do you know the addition worked?

Open the saved PNG and verify that regions with many overlapping points now show lighter areas rather than a solid block of colour. Compare the point count in the chart to len(df) to confirm no points were dropped.