Data Science
Filter Boundary Condition
Run the script and count the rows returned. Then count by hand how many rows in the CSV should satisfy the condition. Do the two counts agree?
import polars as pl
# Keep only rows where value EXCEEDS (strictly greater than) the threshold.
THRESHOLD = 50.0
df = pl.read_csv("filterboundary.csv")
result = df.filter(pl.col("value") >= THRESHOLD)
print(result)
product,value
alpha,30.0
beta,50.0
gamma,70.0
delta,50.0
epsilon,90.0
Show explanation
The bug is using >= instead of > (or vice versa) in the filter expression, so
the script keeps rows it should drop. Teaches how to verify filter logic by checking
boundary values and using .filter() with explicit comparison operators.
Multi-Line CSV Header
Run the script and look at the row count and the first few rows of the DataFrame. Do they match the data you expected to load?
import polars as pl
# Load survey results and report the number of respondents.
df = pl.read_csv("multilinecsv.csv")
print(f"Rows: {len(df)}")
print(df.head())
Source: Annual Survey 2023
Units: thousands
name,count,value
Alice,10,100
Bob,20,200
Carol,15,150
Show explanation
The bug is not passing skip_rows to skip the extra header lines, so Polars reads
the multi-line header as data and reports the wrong number of rows. Teaches how to
inspect the first few rows of a DataFrame with .head() and how to use skip_rows
and has_header to handle non-standard file layouts.
Case-Sensitive Column Name in Join
Run the script and examine the mean_amount column in the result. Are there any
null values where you did not expect them?
import polars as pl
sales = pl.DataFrame({
"Region": ["North", "South", "East", "West", "North", "South"],
"amount": [100, 200, 150, 175, 120, 210],
})
# Compute per-region mean sales.
means = sales.group_by("Region").agg(
pl.col("amount").mean().alias("mean_amount")
)
means = means.with_columns(pl.col("Region").str.to_lowercase())
result = sales.join(means, on="Region", how="left")
print(result)
Show explanation
The bug is joining on a column whose name differs by case ("Region" vs.
"region"), which Polars treats as different columns, so every row in the joined
output has a null for the group mean. Teaches that Polars column names are
case-sensitive and how to diagnose null-filled join results.
Dates Read as Strings
Run the script and check the schema of the DataFrame. What type does Polars assign
to the date column? How many rows does the filter return?
import polars as pl
CUTOFF = "2024-06-01"
df = pl.read_csv("datestring.csv")
result = df.filter(pl.col("date") > CUTOFF)
print(result)
event,date,count
launch,2024-03-01,42
review,2024-06-15,18
release,2024-09-30,75
followup,2024-12-01,31
Show explanation
The bug is that Polars read the date column as strings, so the comparison is
lexicographic rather than chronological and the filter returns no rows even though
matching rows exist. Teaches how to inspect inferred column types with .schema, and
how to cast a column to pl.Date before filtering.
Aggregation Order Error
Run the script and compare the output totals to the values in the CSV file. Do the per-region totals make sense?
import polars as pl
df = pl.read_csv("aggorder.csv")
# Compute total sales per region.
result = (
df.select(pl.col("sales").sum())
.group_by(pl.lit("all"))
.agg(pl.col("sales").sum())
)
print(result)
region,product,sales
North,widget,100
North,gadget,200
South,widget,150
South,gadget,300
East,widget,120
East,gadget,180
Show explanation
The bug is calling .sum() before .group_by(), which sums the entire column first
and then groups a single-row DataFrame, producing unexpectedly large totals. Teaches
the importance of operation order in lazy and eager pipelines and how to verify
intermediate results.
Lazy Evaluation Defers Errors
Run the script and read the error message and traceback. Which step in the pipeline does the error appear to come from? Is that where the mistake actually is?
import polars as pl
data = pl.DataFrame({
"id": [1, 2, 3, 4],
"score": [88, 72, 95, 61],
"label": ["A", "B", "A", "C"],
})
# Build a lazy pipeline: rename "score" to "points", filter, then select.
result = (
data.lazy()
.rename({"score": "points"})
.filter(pl.col("points") > 70)
.select(["id", "score", "label"])
.collect()
)
print(result)
Show explanation
The bug is referencing a column that was renamed in an earlier step, so a
ColumnNotFoundError is raised at .collect() time rather than when the
transformation is written. Teaches how Polars lazy evaluation defers errors and how
to use .collect() on intermediate steps to locate the failing transformation.
Wrong CSV Delimiter
Run the script and examine the column names and values in the combined DataFrame. Are the columns what you expected?
import polars as pl
df_a = pl.read_csv("wrongdelim_a.csv")
df_b = pl.read_csv("wrongdelim_b.csv")
combined = pl.concat([df_a, df_b], how="diagonal")
print(f"Columns: {combined.columns}")
print(combined)
name,age,score
Alice,30,88
Bob,25,72
name;age;score
Carol;28;95
Dave;35;61
Show explanation
The bug is that the second file uses semicolons as delimiters, so Polars reads the
entire row as a single column. When concat is called with how="diagonal", missing
columns are filled with nulls and the result has twice as many columns as expected.
Teaches how to check column names and counts before concatenating DataFrames.
Sentinel Values Mistaken for Data
Run the script and look at the mean. Then inspect the raw data. Are there any values in the column that seem unusually large?
import polars as pl
# 999 is a sentinel value meaning "no response recorded"; treat as missing.
SENTINEL = 999
df = pl.read_csv("sentinel.csv")
clean = df.with_columns(
pl.col("response_time").fill_null(None)
)
print(clean["response_time"].mean())
participant,age,response_time
P001,24,320
P002,31,999
P003,19,415
P004,45,999
P005,27,280
P006,38,999
P007,22,510
Show explanation
The bug is that the dataset uses 999 as a sentinel for missing data rather than a
true null, so .fill_null() has no effect on them and the mean is skewed by what
appear to be valid large numbers. Teaches how to identify domain-specific sentinel
values and replace them with pl.Null before analysis.
Whitespace in Group Keys
Run the script and count the number of groups produced. Is it more than you
expected? Call .unique() on the grouping column and examine what you see.
import polars as pl
df = pl.read_csv("whitespace.csv")
result = df.group_by("region").agg(pl.col("sales").sum())
print(result)
region,sales
North ,100
North,120
South,200
South,180
East,150
East ,160
Show explanation
The bug is that a string column has inconsistent whitespace (e.g., "North " and
"North" are treated as different groups), so group_by followed by agg produces
more groups than expected. Teaches how to inspect unique values with .unique(),
use .str.strip_chars() to normalize strings before grouping, and verify group
counts.
Rolling Window min_periods
Run the script and count the null values in the rolling_mean column. Is the
number of null rows what you expected for a 7-day window?
import polars as pl
WINDOW = 7 # days
df = pl.read_csv("rolling.csv")
result = df.with_columns(
pl.col("value")
.rolling_mean(window_size=WINDOW)
.alias("rolling_mean")
)
print(result)
day,value
1,10
2,12
3,9
4,14
5,11
6,13
7,15
8,12
9,10
10,16
Show explanation
The bug is passing window_size=7 without setting min_periods=1, so any window
that cannot be fully filled returns null and the result has far more nulls than
expected. Teaches how rolling aggregations handle incomplete windows and how to
choose between strict and lenient behavior with min_periods.
Missing Quantitative Encoding Type
Run the script and open the saved chart in a browser. Do all the bars have heights
that reflect the value column?
import altair as alt
import polars as pl
data = pl.DataFrame({
"category": ["A", "B", "C", "D"],
"value": [10, 40, 25, 60],
}).to_pandas()
chart = alt.Chart(data).mark_bar().encode(
x=alt.X("category"),
y=alt.Y("value"),
)
chart.save("quanttype.html")
Show explanation
The bug is encoding the y-axis with alt.Y("value") without specifying
type="quantitative", so Altair treats the column as nominal and counts categories
instead of summing values, giving all bars the same height. Teaches how Altair
infers encoding types and why specifying type explicitly avoids silent
misinterpretation.
Color Scale from String Column
Run the script and open the saved chart. Does the color scale appear as a continuous gradient, or as a discrete set of colors?
import altair as alt
import polars as pl
# Simulated dataset where "temperature" was read from a CSV as strings.
data = pl.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 20, 15, 30, 25],
"temperature": ["1.2", "2.4", "3.5", "4.1", "5.0"],
}).to_pandas()
chart = alt.Chart(data).mark_point().encode(
x="x:Q",
y="y:Q",
color="temperature",
)
chart.save("colorscale.html")
Show explanation
The bug is that the color column was read as a string (e.g., "3.5") rather than a
float, so Altair applies a nominal color scale and the scatter plot shows a discrete
legend with arbitrary colors instead of a continuous gradient. Teaches how data types
in the source DataFrame determine Altair's default encoding choices.
Nominal vs. Temporal Encoding
Open the saved chart in a browser. Are the months arranged in chronological order along the x-axis, or in a different order?
import altair as alt
import polars as pl
data = pl.DataFrame({
"month": ["2024-01", "2024-02", "2024-03", "2024-04", "2024-05"],
"sales": [120, 95, 140, 110, 160],
}).to_pandas()
chart = alt.Chart(data).mark_line(point=True).encode(
x=alt.X("month:N"),
y=alt.Y("sales:Q"),
)
chart.save("temporal.html")
Show explanation
The bug is encoding the x-axis date column as type="nominal" instead of
type="temporal", so Altair does not order the points chronologically and the line
chart draws disconnected segments instead of a continuous line. Teaches the
difference between nominal and temporal encoding in Altair and how to verify axis
ordering.
Altair Filter on Wrong Field
Open the saved chart in a browser. Does it show only the categories whose count is 100 or more, or does it show all of them?
import altair as alt
import polars as pl
data = pl.DataFrame({
"category": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"],
"count": [50, 120, 30, 200, 80, 15, 175, 60, 90, 140, 25],
}).to_pandas()
# Show only categories with count >= 100.
chart = alt.Chart(data).mark_bar().encode(
x=alt.X("category:N"),
y=alt.Y("count:Q"),
).transform_filter(
alt.datum.Count >= 100
)
chart.save("filterfield.html")
Show explanation
The bug is a field name in alt.Filter that does not match any column. Altair
silently ignores the filter rather than raising an error, so all categories are
shown instead of just the top 10. Teaches how to debug Altair transforms by
inspecting the chart's JSON specification and checking field names match the data
source.
Tooltip Field with Spaces
Open the saved chart in a browser and hover over a point. Does the Sales Region
field in the tooltip show a value?
import altair as alt
import polars as pl
data = pl.DataFrame({
"Product Name": ["Widget", "Gadget", "Doohickey"],
"Sales Region": ["North", "South", "East"],
"revenue": [1200, 800, 950],
}).to_pandas()
chart = alt.Chart(data).mark_point().encode(
x=alt.X("revenue:Q"),
tooltip=["Product Name:N", "Sales Region", "revenue:Q"],
)
chart.save("tooltip.html")
Show explanation
The bug is that the tooltip field name has a space in it (e.g., "Sales Region")
but is referenced without quoting in the Altair shorthand string, so the tooltip
shows null for that field even though the data contains values. Teaches how Altair
shorthand handles special characters and when to use
alt.Tooltip(field=..., title=...) instead.
Polars DataFrame in Altair
Run the script and open the saved chart in a browser. Does the chart show any data points?
import altair as alt
import polars as pl
df = pl.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 25, 15, 30, 20],
})
chart = alt.Chart(df).mark_line().encode(
x="x:Q",
y="y:Q",
)
chart.save("polarsinaltair.html")
Show explanation
The bug is passing the Polars DataFrame directly to alt.Chart() instead of
converting it to a pandas DataFrame or using alt.Data, so the chart is blank.
Teaches which data formats Altair accepts natively and how to convert between Polars
and the formats Altair supports.
Spurious Perfect Correlation
Run the script and note the correlation value. Then examine how metric_a and
metric_b are constructed. Should they really be perfectly correlated?
import polars as pl
data = pl.DataFrame({
"base": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0],
})
data = data.with_columns(
(pl.col("base") * 2.5).alias("metric_a"),
(pl.col("base") * 2.5 + 10.0).alias("metric_b"),
)
corr = data.select(pl.corr("metric_a", "metric_b")).item()
print(f"Correlation: {corr}")
Show explanation
The bug is that both columns were derived from the same source column in the same pipeline step (a copy rather than an independent transformation), so the correlation is exactly 1.0 for columns that should not be perfectly correlated. Teaches how to audit column provenance in a pipeline and use scatter plots to sanity-check correlation claims.
Memory from Chunk Accumulation
Run the script on a large file and watch how memory usage changes as the script runs. Does memory stay roughly constant, or does it grow?
import polars as pl
CHUNK_SIZE = 2 # rows per chunk
chunks = []
for chunk in pl.read_csv_batched("chunkaccum.csv", batch_size=CHUNK_SIZE):
chunks.append(chunk)
result = pl.concat(chunks)
print(result["value"].sum())
id,value
1,10
2,20
3,30
4,40
5,50
Show explanation
The bug is accumulating all chunks in memory before concatenating rather than
processing each chunk and writing results incrementally, so the pipeline runs out of
memory on large files. Teaches streaming versus batch processing patterns and how to
use Polars' scan_csv with lazy evaluation to avoid loading the full file.
Float Year in Faceted Chart
Open the saved chart in a browser. How many facet panels does it show? Inspect the
type of the year column in the DataFrame.
import altair as alt
import polars as pl
data = pl.DataFrame({
"region": ["North", "South", "East", "North", "South", "East"],
"year": [2022.0, 2022.0, 2022.0, 2023.0, 2023.0, 2023.0],
"sales": [100, 150, 120, 130, 160, 140],
}).to_pandas()
chart = alt.Chart(data).mark_bar().encode(
x=alt.X("region:N"),
y=alt.Y("sales:Q"),
).facet(
facet="year:N",
columns=2,
)
chart.save("floatyear.html")
Show explanation
The bug is that the year column contains floats (e.g., 2021.0) because Polars
inferred it as Float64. Altair's facet treats each unique float as a separate
nominal value but the layout collapses to one panel due to the unexpected type.
Teaches how to cast integer-like columns to pl.Int32 before charting and how to
verify facet behavior with a small sample.
Out-of-Order Notebook Cells
Run this script from top to bottom. Does it raise an error? Which line causes the error?
import polars as pl
raw = pl.read_csv("filterboundary.csv")
summary = clean.group_by("product").agg(pl.col("value").mean()) # noqa: F821
clean = raw.filter(pl.col("value") > 40)
print(summary)
Show explanation
The bug is that the notebook had cells executed out of order, leaving a modified DataFrame in memory that masked an error in the cleaning step. The script produces correct results when run step by step in a notebook but wrong results when run as a script. Teaches why notebooks must be tested by restarting the kernel and running all cells in order, and how to structure pipelines so each step depends only on its explicit inputs.