Data Science

Filter Boundary Condition

Run the script and count the rows returned. Then count by hand how many rows in the CSV should satisfy the condition. Do the two counts agree?

import polars as pl

# Keep only rows where value EXCEEDS (strictly greater than) the threshold.
THRESHOLD = 50.0

df = pl.read_csv("filterboundary.csv")

result = df.filter(pl.col("value") >= THRESHOLD)
print(result)
product,value
alpha,30.0
beta,50.0
gamma,70.0
delta,50.0
epsilon,90.0
Show explanation

The bug is using >= instead of > (or vice versa) in the filter expression, so the script keeps rows it should drop. Teaches how to verify filter logic by checking boundary values and using .filter() with explicit comparison operators.

Multi-Line CSV Header

Run the script and look at the row count and the first few rows of the DataFrame. Do they match the data you expected to load?

import polars as pl

# Load survey results and report the number of respondents.

df = pl.read_csv("multilinecsv.csv")
print(f"Rows: {len(df)}")
print(df.head())
Source: Annual Survey 2023
Units: thousands
name,count,value
Alice,10,100
Bob,20,200
Carol,15,150
Show explanation

The bug is not passing skip_rows to skip the extra header lines, so Polars reads the multi-line header as data and reports the wrong number of rows. Teaches how to inspect the first few rows of a DataFrame with .head() and how to use skip_rows and has_header to handle non-standard file layouts.

Case-Sensitive Column Name in Join

Run the script and examine the mean_amount column in the result. Are there any null values where you did not expect them?

import polars as pl

sales = pl.DataFrame({
    "Region": ["North", "South", "East", "West", "North", "South"],
    "amount": [100, 200, 150, 175, 120, 210],
})

# Compute per-region mean sales.
means = sales.group_by("Region").agg(
    pl.col("amount").mean().alias("mean_amount")
)

means = means.with_columns(pl.col("Region").str.to_lowercase())

result = sales.join(means, on="Region", how="left")
print(result)
Show explanation

The bug is joining on a column whose name differs by case ("Region" vs. "region"), which Polars treats as different columns, so every row in the joined output has a null for the group mean. Teaches that Polars column names are case-sensitive and how to diagnose null-filled join results.

Dates Read as Strings

Run the script and check the schema of the DataFrame. What type does Polars assign to the date column? How many rows does the filter return?

import polars as pl

CUTOFF = "2024-06-01"

df = pl.read_csv("datestring.csv")

result = df.filter(pl.col("date") > CUTOFF)
print(result)
event,date,count
launch,2024-03-01,42
review,2024-06-15,18
release,2024-09-30,75
followup,2024-12-01,31
Show explanation

The bug is that Polars read the date column as strings, so the comparison is lexicographic rather than chronological and the filter returns no rows even though matching rows exist. Teaches how to inspect inferred column types with .schema, and how to cast a column to pl.Date before filtering.

Aggregation Order Error

Run the script and compare the output totals to the values in the CSV file. Do the per-region totals make sense?

import polars as pl

df = pl.read_csv("aggorder.csv")

# Compute total sales per region.

result = (
    df.select(pl.col("sales").sum())
    .group_by(pl.lit("all"))
    .agg(pl.col("sales").sum())
)
print(result)
region,product,sales
North,widget,100
North,gadget,200
South,widget,150
South,gadget,300
East,widget,120
East,gadget,180
Show explanation

The bug is calling .sum() before .group_by(), which sums the entire column first and then groups a single-row DataFrame, producing unexpectedly large totals. Teaches the importance of operation order in lazy and eager pipelines and how to verify intermediate results.

Lazy Evaluation Defers Errors

Run the script and read the error message and traceback. Which step in the pipeline does the error appear to come from? Is that where the mistake actually is?

import polars as pl

data = pl.DataFrame({
    "id": [1, 2, 3, 4],
    "score": [88, 72, 95, 61],
    "label": ["A", "B", "A", "C"],
})

# Build a lazy pipeline: rename "score" to "points", filter, then select.
result = (
    data.lazy()
    .rename({"score": "points"})
    .filter(pl.col("points") > 70)
    .select(["id", "score", "label"])
    .collect()
)
print(result)
Show explanation

The bug is referencing a column that was renamed in an earlier step, so a ColumnNotFoundError is raised at .collect() time rather than when the transformation is written. Teaches how Polars lazy evaluation defers errors and how to use .collect() on intermediate steps to locate the failing transformation.

Wrong CSV Delimiter

Run the script and examine the column names and values in the combined DataFrame. Are the columns what you expected?

import polars as pl

df_a = pl.read_csv("wrongdelim_a.csv")

df_b = pl.read_csv("wrongdelim_b.csv")

combined = pl.concat([df_a, df_b], how="diagonal")
print(f"Columns: {combined.columns}")
print(combined)
name,age,score
Alice,30,88
Bob,25,72
name;age;score
Carol;28;95
Dave;35;61
Show explanation

The bug is that the second file uses semicolons as delimiters, so Polars reads the entire row as a single column. When concat is called with how="diagonal", missing columns are filled with nulls and the result has twice as many columns as expected. Teaches how to check column names and counts before concatenating DataFrames.

Sentinel Values Mistaken for Data

Run the script and look at the mean. Then inspect the raw data. Are there any values in the column that seem unusually large?

import polars as pl

# 999 is a sentinel value meaning "no response recorded"; treat as missing.
SENTINEL = 999

df = pl.read_csv("sentinel.csv")

clean = df.with_columns(
    pl.col("response_time").fill_null(None)
)
print(clean["response_time"].mean())
participant,age,response_time
P001,24,320
P002,31,999
P003,19,415
P004,45,999
P005,27,280
P006,38,999
P007,22,510
Show explanation

The bug is that the dataset uses 999 as a sentinel for missing data rather than a true null, so .fill_null() has no effect on them and the mean is skewed by what appear to be valid large numbers. Teaches how to identify domain-specific sentinel values and replace them with pl.Null before analysis.

Whitespace in Group Keys

Run the script and count the number of groups produced. Is it more than you expected? Call .unique() on the grouping column and examine what you see.

import polars as pl

df = pl.read_csv("whitespace.csv")

result = df.group_by("region").agg(pl.col("sales").sum())
print(result)
region,sales
North ,100
North,120
South,200
 South,180
East,150
East ,160
Show explanation

The bug is that a string column has inconsistent whitespace (e.g., "North " and "North" are treated as different groups), so group_by followed by agg produces more groups than expected. Teaches how to inspect unique values with .unique(), use .str.strip_chars() to normalize strings before grouping, and verify group counts.

Rolling Window min_periods

Run the script and count the null values in the rolling_mean column. Is the number of null rows what you expected for a 7-day window?

import polars as pl

WINDOW = 7  # days

df = pl.read_csv("rolling.csv")

result = df.with_columns(
    pl.col("value")
    .rolling_mean(window_size=WINDOW)
    .alias("rolling_mean")
)
print(result)
day,value
1,10
2,12
3,9
4,14
5,11
6,13
7,15
8,12
9,10
10,16
Show explanation

The bug is passing window_size=7 without setting min_periods=1, so any window that cannot be fully filled returns null and the result has far more nulls than expected. Teaches how rolling aggregations handle incomplete windows and how to choose between strict and lenient behavior with min_periods.

Missing Quantitative Encoding Type

Run the script and open the saved chart in a browser. Do all the bars have heights that reflect the value column?

import altair as alt
import polars as pl

data = pl.DataFrame({
    "category": ["A", "B", "C", "D"],
    "value": [10, 40, 25, 60],
}).to_pandas()

chart = alt.Chart(data).mark_bar().encode(
    x=alt.X("category"),
    y=alt.Y("value"),
)
chart.save("quanttype.html")
Show explanation

The bug is encoding the y-axis with alt.Y("value") without specifying type="quantitative", so Altair treats the column as nominal and counts categories instead of summing values, giving all bars the same height. Teaches how Altair infers encoding types and why specifying type explicitly avoids silent misinterpretation.

Color Scale from String Column

Run the script and open the saved chart. Does the color scale appear as a continuous gradient, or as a discrete set of colors?

import altair as alt
import polars as pl

# Simulated dataset where "temperature" was read from a CSV as strings.
data = pl.DataFrame({
    "x": [1, 2, 3, 4, 5],
    "y": [10, 20, 15, 30, 25],
    "temperature": ["1.2", "2.4", "3.5", "4.1", "5.0"],
}).to_pandas()

chart = alt.Chart(data).mark_point().encode(
    x="x:Q",
    y="y:Q",
    color="temperature",
)
chart.save("colorscale.html")
Show explanation

The bug is that the color column was read as a string (e.g., "3.5") rather than a float, so Altair applies a nominal color scale and the scatter plot shows a discrete legend with arbitrary colors instead of a continuous gradient. Teaches how data types in the source DataFrame determine Altair's default encoding choices.

Nominal vs. Temporal Encoding

Open the saved chart in a browser. Are the months arranged in chronological order along the x-axis, or in a different order?

import altair as alt
import polars as pl

data = pl.DataFrame({
    "month": ["2024-01", "2024-02", "2024-03", "2024-04", "2024-05"],
    "sales": [120, 95, 140, 110, 160],
}).to_pandas()

chart = alt.Chart(data).mark_line(point=True).encode(
    x=alt.X("month:N"),
    y=alt.Y("sales:Q"),
)
chart.save("temporal.html")
Show explanation

The bug is encoding the x-axis date column as type="nominal" instead of type="temporal", so Altair does not order the points chronologically and the line chart draws disconnected segments instead of a continuous line. Teaches the difference between nominal and temporal encoding in Altair and how to verify axis ordering.

Altair Filter on Wrong Field

Open the saved chart in a browser. Does it show only the categories whose count is 100 or more, or does it show all of them?

import altair as alt
import polars as pl

data = pl.DataFrame({
    "category": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"],
    "count": [50, 120, 30, 200, 80, 15, 175, 60, 90, 140, 25],
}).to_pandas()

# Show only categories with count >= 100.

chart = alt.Chart(data).mark_bar().encode(
    x=alt.X("category:N"),
    y=alt.Y("count:Q"),
).transform_filter(
    alt.datum.Count >= 100
)
chart.save("filterfield.html")
Show explanation

The bug is a field name in alt.Filter that does not match any column. Altair silently ignores the filter rather than raising an error, so all categories are shown instead of just the top 10. Teaches how to debug Altair transforms by inspecting the chart's JSON specification and checking field names match the data source.

Tooltip Field with Spaces

Open the saved chart in a browser and hover over a point. Does the Sales Region field in the tooltip show a value?

import altair as alt
import polars as pl

data = pl.DataFrame({
    "Product Name": ["Widget", "Gadget", "Doohickey"],
    "Sales Region": ["North", "South", "East"],
    "revenue": [1200, 800, 950],
}).to_pandas()

chart = alt.Chart(data).mark_point().encode(
    x=alt.X("revenue:Q"),
    tooltip=["Product Name:N", "Sales Region", "revenue:Q"],
)
chart.save("tooltip.html")
Show explanation

The bug is that the tooltip field name has a space in it (e.g., "Sales Region") but is referenced without quoting in the Altair shorthand string, so the tooltip shows null for that field even though the data contains values. Teaches how Altair shorthand handles special characters and when to use alt.Tooltip(field=..., title=...) instead.

Polars DataFrame in Altair

Run the script and open the saved chart in a browser. Does the chart show any data points?

import altair as alt
import polars as pl

df = pl.DataFrame({
    "x": [1, 2, 3, 4, 5],
    "y": [10, 25, 15, 30, 20],
})

chart = alt.Chart(df).mark_line().encode(
    x="x:Q",
    y="y:Q",
)
chart.save("polarsinaltair.html")
Show explanation

The bug is passing the Polars DataFrame directly to alt.Chart() instead of converting it to a pandas DataFrame or using alt.Data, so the chart is blank. Teaches which data formats Altair accepts natively and how to convert between Polars and the formats Altair supports.

Spurious Perfect Correlation

Run the script and note the correlation value. Then examine how metric_a and metric_b are constructed. Should they really be perfectly correlated?

import polars as pl

data = pl.DataFrame({
    "base": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0],
})

data = data.with_columns(
    (pl.col("base") * 2.5).alias("metric_a"),
    (pl.col("base") * 2.5 + 10.0).alias("metric_b"),
)

corr = data.select(pl.corr("metric_a", "metric_b")).item()
print(f"Correlation: {corr}")
Show explanation

The bug is that both columns were derived from the same source column in the same pipeline step (a copy rather than an independent transformation), so the correlation is exactly 1.0 for columns that should not be perfectly correlated. Teaches how to audit column provenance in a pipeline and use scatter plots to sanity-check correlation claims.

Memory from Chunk Accumulation

Run the script on a large file and watch how memory usage changes as the script runs. Does memory stay roughly constant, or does it grow?

import polars as pl

CHUNK_SIZE = 2  # rows per chunk

chunks = []
for chunk in pl.read_csv_batched("chunkaccum.csv", batch_size=CHUNK_SIZE):
    chunks.append(chunk)

result = pl.concat(chunks)
print(result["value"].sum())
id,value
1,10
2,20
3,30
4,40
5,50
Show explanation

The bug is accumulating all chunks in memory before concatenating rather than processing each chunk and writing results incrementally, so the pipeline runs out of memory on large files. Teaches streaming versus batch processing patterns and how to use Polars' scan_csv with lazy evaluation to avoid loading the full file.

Float Year in Faceted Chart

Open the saved chart in a browser. How many facet panels does it show? Inspect the type of the year column in the DataFrame.

import altair as alt
import polars as pl

data = pl.DataFrame({
    "region": ["North", "South", "East", "North", "South", "East"],
    "year": [2022.0, 2022.0, 2022.0, 2023.0, 2023.0, 2023.0],
    "sales": [100, 150, 120, 130, 160, 140],
}).to_pandas()

chart = alt.Chart(data).mark_bar().encode(
    x=alt.X("region:N"),
    y=alt.Y("sales:Q"),
).facet(
    facet="year:N",
    columns=2,
)
chart.save("floatyear.html")
Show explanation

The bug is that the year column contains floats (e.g., 2021.0) because Polars inferred it as Float64. Altair's facet treats each unique float as a separate nominal value but the layout collapses to one panel due to the unexpected type. Teaches how to cast integer-like columns to pl.Int32 before charting and how to verify facet behavior with a small sample.

Out-of-Order Notebook Cells

Run this script from top to bottom. Does it raise an error? Which line causes the error?

import polars as pl


raw = pl.read_csv("filterboundary.csv")

summary = clean.group_by("product").agg(pl.col("value").mean())  # noqa: F821

clean = raw.filter(pl.col("value") > 40)

print(summary)
Show explanation

The bug is that the notebook had cells executed out of order, leaving a modified DataFrame in memory that masked an error in the cleaning step. The script produces correct results when run step by step in a notebook but wrong results when run as a script. Teaches why notebooks must be tested by restarting the kernel and running all cells in order, and how to structure pipelines so each step depends only on its explicit inputs.