Telling Stories

Goals

Matching Chart Type to Question

How do I decide which type of chart to make?

What question will we ask?

Building a Faceted Chart

Make a bar chart of mean percentage at Level 3 or 4 in Grade 3 reading by board type, with one panel per school language.

or

Using Polars and Altair, read eqao_school_results.csv. Make a bar chart of mean Grade 3 reading percentage (level 3 or 4) by board type, with a separate panel for English and French schools. Save it as reading_chart.png.

i
import polars as pl
import altair as alt

df = pl.read_csv("eqao_school_results.csv", null_values=[""])

chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X("board_type:N", title="Board Type"),
        y=alt.Y("mean(grade3_reading_pct):Q",
                title="Mean % at Level 3 or 4 (Grade 3 Reading)"),
        column=alt.Column("school_language:N", title="School Language"),
        color=alt.Color("board_type:N", legend=None),
        tooltip=["board_type", "school_language", "mean(grade3_reading_pct)"],
    )
    .properties(title="Mean Grade 3 Reading Score by Board Type and Language",
                width=200, height=250)
)
chart.save("reading_chart.png")
print("Saved reading_chart.png")

Reading the Chart

What does the chart show, and what does it not show?

Why Correlation Is Not Causation

The chart shows Catholic schools score slightly higher. Does that prove board type causes better scores?

Fixing a Misleading Axis

The y axis starts at 55 instead of 0. Fix the chart so the differences are not visually exaggerated.

or

Remake the reading chart but set the y axis to start at 0 so the bar heights honestly show the magnitude of the differences.

i
import polars as pl
import altair as alt

df = pl.read_csv("eqao_school_results.csv", null_values=[""])

chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X("board_type:N", title="Board Type"),
        y=alt.Y("mean(grade3_reading_pct):Q",
                title="Mean % at Level 3 or 4 (Grade 3 Reading)",
                scale=alt.Scale(domainMin=0)),
        column=alt.Column("school_language:N", title="School Language"),
        color=alt.Color("board_type:N", legend=None),
        tooltip=["board_type", "school_language", "mean(grade3_reading_pct)"],
    )
    .properties(title="Mean Grade 3 Reading Score by Board Type and Language (y-axis from 0)",
                width=200, height=250)
)
chart.save("reading_chart_fixed.png")
print("Saved reading_chart_fixed.png")

Validating Each Panel

How do I confirm that each panel in the faceted chart contains the data I expect?

or

For each combination of board type and school language, print the number of schools and the mean reading percentage.

i
import polars as pl

df = pl.read_csv("eqao_school_results.csv", null_values=[""])

panel_counts = (
    df.group_by(["board_type", "school_language"])
    .agg([
        pl.len().alias("n_schools"),
        pl.col("grade3_reading_pct").mean().round(1).alias("mean_reading_pct"),
    ])
    .sort(["school_language", "board_type"])
)
print(panel_counts)

Check Understanding

You want to show how Ontario's mean Grade 3 reading score has changed over the years the EQAO has been published. Which chart type should you use, and why?

A line chart with year on the x axis and mean score on the y axis. A line chart is appropriate for showing change over time because the connected line signals continuity and directionality. A bar chart could also work but would not convey the same sense of trajectory. A scatter plot would work but lacks the visual connection between years.

A classmate makes a bar chart where the y axis runs from 60 to 68. The Catholic bar reaches 67 and the Public bar reaches 64. They say "Catholic schools perform 5% better." What is wrong?

Two errors. First, 67 minus 64 is 3 percentage points, not 5%. Second, on a scale from 0 to 100, a 3-point difference is 3% of the full scale, but the y axis starting at 60 makes the Catholic bar look nearly twice as tall as the Public bar. The correct description is "Catholic schools had a mean reading score 3 percentage points higher in this dataset," and the chart should start at 0 to make the difference look proportional.

The data shows that schools in boards with higher per-pupil funding have higher mean reading scores. A classmate says "more money causes better scores." What alternative explanations should you consider?

Higher-income municipalities generate more local property tax revenue, which funds higher per-pupil spending and also tends to mean students come from families with more resources, more stability, and more access to books and enrichment activities. Both the funding and the scores may be driven by the same underlying factor (community wealth) rather than one causing the other. Establishing that funding itself causes higher scores would require comparing schools that received different funding levels for reasons unrelated to community wealth.

You check the panel validation table and find that one cell contains only 2 schools with a mean reading percentage of 91%. Should this bar appear in the published chart? What should you do?

A mean based on 2 schools is unreliable: one school with an unusual student population could set the mean at any value. Either remove bars with fewer than a minimum threshold (say, 10 schools) and note the cutoff, or add a note to the chart indicating that cells with fewer than 10 schools are suppressed or flagged. EQAO itself suppresses results for small schools for exactly this reason; you should follow the same practice.

Exercises

Histogram of Reading Scores

Plot a histogram of the Grade 3 reading percentage across all schools. Describe the shape of the distribution. Are there any schools with unusually high or low scores?

Math vs. Reading

Make a scatter plot of Grade 3 reading percentage on the x axis and Grade 3 math percentage on the y axis, coloured by board type. Compute the correlation. Do schools that perform well in reading tend to perform well in math?

Urban vs. Rural

If the dataset includes a municipality or region column, compare mean reading scores between schools in large cities and schools in smaller communities. Does the pattern match what you would expect?

Chart Comparison

Make three versions of the mean reading score by board type chart: one starting the y axis at 0, one starting at 50, and one starting at the minimum mean value. Write one sentence describing how the visual impression changes in each version.

Exaggerated Differences

The following code draws the reading score chart with a y axis that starts at 55 instead of 0, making small differences look much larger than they are. Work with an LLM to fix the axis so the bars are drawn to an honest scale.

i
import polars as pl
import altair as alt

df = pl.read_csv("eqao_school_results.csv", null_values=[""])

chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X("board_type:N", title="Board Type"),
        y=alt.Y("mean(grade3_reading_pct):Q",
                title="Mean % at Level 3 or 4",
                scale=alt.Scale(domain=[55, 75])),
        column=alt.Column("school_language:N", title="School Language"),
        color=alt.Color("board_type:N", legend=None),
    )
    .properties(title="Mean Grade 3 Reading Score by Board Type and Language",
                width=200, height=250)
)
chart.save("reading_chart.png")
print("Saved reading_chart.png")
How do you know the fix worked?

Open both the original and fixed PNGs side by side. In the fixed version, the Catholic and Public bars should look similar in height if their mean scores differ by only a few percentage points on a 0-to-100 scale.

Facets on the Wrong Variable

The following code is meant to show one panel per school language, but the panels are labelled with board types instead. Work with an LLM to find the wrong column and fix it.

i
import polars as pl
import altair as alt

df = pl.read_csv("eqao_school_results.csv", null_values=[""])

chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X("board_type:N", title="Board Type"),
        y=alt.Y("mean(grade3_reading_pct):Q",
                title="Mean % at Level 3 or 4"),
        column=alt.Column("board_type:N", title="Board Type"),
        color=alt.Color("board_type:N", legend=None),
    )
    .properties(title="Mean Grade 3 Reading Score by Board Type and Language",
                width=200, height=250)
)
chart.save("reading_chart.png")
print("Saved reading_chart.png")
How do you know the fix worked?

The fixed chart should have panels labelled "English" and "French," not "Public" and "Catholic." Check that each panel contains bars for both board types.

Colour Scale for a Continuous Variable

The following code draws a scatter plot of reading vs. math scores coloured by reading percentage, but the legend shows dozens of discrete colour swatches instead of a smooth gradient. Work with an LLM to find the encoding error and fix it.

i
import polars as pl
import altair as alt

df = pl.read_csv("eqao_school_results.csv", null_values=[""])
df = df.drop_nulls(subset=["grade3_reading_pct", "grade3_math_pct"])

chart = (
    alt.Chart(df)
    .mark_point(size=30, opacity=0.5)
    .encode(
        x=alt.X("grade3_reading_pct:Q", title="Grade 3 Reading (%)"),
        y=alt.Y("grade3_math_pct:Q", title="Grade 3 Math (%)"),
        color=alt.Color("grade3_reading_pct:N", title="Reading Score"),
    )
    .properties(title="Reading vs. Math by School",
                width=400, height=400)
)
chart.save("reading_math.png")
print("Saved reading_math.png")
How do you know the fix worked?

After fixing, the legend should show a continuous colour gradient from low to high reading scores. Print df["grade3_reading_pct"].dtype and confirm it is Float64, not String.

Showing School Counts on the Bars

The following code draws a bar chart of mean reading scores per board type per language panel. Work with an LLM to extend it so the number of schools in each group appears as a text label on top of each bar.

i
import polars as pl
import altair as alt

df = pl.read_csv("eqao_school_results.csv", null_values=[""])

summary = (
    df.group_by(["board_type", "school_language"])
    .agg([
        pl.col("grade3_reading_pct").mean().alias("mean_reading"),
        pl.len().alias("n_schools"),
    ])
)

chart = (
    alt.Chart(summary)
    .mark_bar()
    .encode(
        x=alt.X("board_type:N", title="Board Type"),
        y=alt.Y("mean_reading:Q", title="Mean % at Level 3 or 4",
                scale=alt.Scale(domain=[0, 100])),
        column=alt.Column("school_language:N", title="School Language"),
        color=alt.Color("board_type:N", legend=None),
    )
    .properties(width=200, height=250)
)
chart.save("reading_chart.png")
print("Saved reading_chart.png")
# TODO: add a text mark showing n_schools on top of each bar
How do you know the addition is correct?

Compare the text labels to the n_schools values already in summary. For one group, count the matching rows in the raw CSV yourself and confirm the label is right.