Visualization with Altair

Learning Goals

Why Visualize Before You Calculate?

Altair's Grammar

Histograms

Scatter Plots

Box Plots

Log Scales

Jitter

Faceting

i
"""Altair visualization examples using Prechelt and function-count data."""

import altair as alt
import polars as pl

# Box plot of working hours by language (Prechelt)
prechelt = pl.read_csv("data/jccpprtTR.csv")
chart1 = (
    alt.Chart(prechelt.to_pandas())
    .mark_boxplot()
    .encode(
        x=alt.X("lang:N", title="Language"), y=alt.Y("whours:Q", title="Working Hours")
    )
    .properties(title="Development Time by Language")
)
chart1.save("figures/boxplot.html")

# Log-scale histogram of lines per file
funcs = pl.read_csv("data/py_func_counts.csv")
chart2 = (
    alt.Chart(funcs.to_pandas())
    .mark_bar()
    .encode(
        x=alt.X("lines:Q", bin=alt.Bin(maxbins=30), title="Lines per File"),
        y=alt.Y("count():Q", scale=alt.Scale(type="log"), title="Count (log)"),
    )
    .properties(title="Python File Sizes (log scale)")
)
chart2.save("figures/file_sizes.html")

Check Understanding

What does Anscombe's quartet demonstrate about the relationship between summary statistics and data distributions?

Anscombe's quartet is four datasets that share nearly identical means, variances, and correlations, yet look completely different when plotted. One is linear, one is curved, one has a single outlier that distorts the regression line, and one is nearly a vertical cluster with one distant point. The quartet demonstrates that a handful of numbers can never fully describe a distribution, and that visualization is not optional — it is the first step of any honest analysis.

The code below is supposed to display a log Y axis. What is wrong with it, and how do you fix it?
chart = (alt.Chart(df.to_pandas())
         .mark_bar()
         .encode(x="language", y="count()"))
# the user wants a log Y axis
chart = chart.encode(y=alt.Y("count()", scale="log"))

Two things are wrong. First, scale="log" is not a valid argument: Altair expects an alt.Scale object, not a bare string. Second, calling .encode() on an already-built chart does not modify it in place — it returns a new chart, but only overrides the channels explicitly listed, which can cause unexpected merging behavior. The correct approach is to set the encoding in one place from the start:

chart = (alt.Chart(df.to_pandas())
         .mark_bar()
         .encode(x="language:N",
                 y=alt.Y("count():Q", scale=alt.Scale(type="log"))))
When should you use a log scale instead of a linear scale?

Use a log scale when the data spans several orders of magnitude or has a heavy right tail — situations where most values are small but a few are vastly larger. File sizes, response times, bug counts, and package release counts are typical candidates. A log scale compresses the long tail and spreads out the dense cluster near zero, making the shape of the distribution visible. If your data contains zero, you cannot use a log scale directly; add 1 first or filter out the zeros and explain why.

What is the difference between color encoding and faceting in Altair, and when would you use each?

Color encoding adds a third variable to a single plot by assigning different colors to different categories. Faceting splits the data into separate sub-plots, one per category, all sharing the same axes. Use color when you have a small number of groups (ideally no more than four or five) and the groups do not overlap too much. Use faceting when the groups overlap badly, when you have more categories than colors you can distinguish, or when you want readers to focus on the shape of each group rather than on comparisons between groups. Shared axes in a faceted plot make cross-group comparison honest; independent axes highlight within-group structure.

Exercises

Linear vs. Log Y Axis

Make two histograms of lines-per-file from the Python function-count dataset: one with a linear Y axis and one with a log Y axis, using the same bin width for both. Write two sentences comparing what each version reveals. The linear version likely shows a large spike at very small file sizes; the log version probably makes the shape of the rest of the distribution visible. Then write a Polars filter that selects only files with fewer than five lines and inspect their contents to check whether they are empty or contain only comments or docstrings.

Scatter Plot with Size Categories

Make a scatter plot of lines-per-file (X axis) versus functions-per-file (Y axis) for the Python dataset. Add a color encoding for a size category column that you compute in Polars: "small" for files with fewer than 50 lines, "medium" for 50 to 500 lines, and "large" for more than 500 lines. Use opacity=alt.value(0.4) to reduce overplotting. Write one sentence describing the pattern you see in the large-file region, and one sentence explaining whether the relationship between lines and functions looks linear across all size categories.

Jittered Strip Plot

Reproduce the Prechelt box plot of working hours by language using Altair. Then layer a jittered strip plot on top of it by adding a second mark_point layer that uses transform_calculate to add a small random horizontal offset to each point. The two layers should share the same data and the same Y axis. Write one sentence about what the individual points reveal that the box plot alone does not show — for example, whether any language has a suspicious cluster of identical values or a single extreme outlier that dominates the box.

Faceted Histogram

Load both the Python and JavaScript function-count datasets, add a language column to each, and concatenate them into a single dataframe. Create a faceted histogram of lines-per-function using 20-line bins, with one panel per language and a shared Y axis. Write one sentence about what the shared Y axis reveals that you would miss if each panel had an independent Y axis scaled to its own data — specifically, whether the two languages have similar absolute counts in each bin or whether one language simply has far more files.

Anscombe's Fifth Example

Find or construct two numerical variables where the Pearson correlation coefficient is between 0.9 and 1.0, but the scatter plot reveals an obvious non-linearity or a single dominant outlier that is driving the correlation. You might use a quadratic relationship sampled at evenly-spaced x values, or a dataset where removing one point drops the correlation below 0.5. Write the Polars code to compute the correlation, produce the scatter plot in Altair, and write two sentences explaining why a researcher who reported only the correlation value would be misleading their readers — even though the number itself is technically correct.