Data Science
Select rows within a date range
Run the script and count the rows returned. Then count by hand how many rows in the CSV should satisfy the condition. Do the two counts agree?
import polars as pl
# Keep only rows where value EXCEEDS (strictly greater than) the threshold.
THRESHOLD = 50.0
df = pl.read_csv("filterboundary.csv")
result = df.filter(pl.col("value") >= THRESHOLD)
print(result)
product,value
alpha,30.0
beta,50.0
gamma,70.0
delta,50.0
epsilon,90.0
Show explanation
The bug is using >= instead of > (or vice versa) in the filter expression, so
the script keeps rows it should drop.
Shows: how to verify filter logic by checking boundary values and
using .filter() with explicit comparison operators.
To find it: count the matching rows by hand from the CSV, then compare to
print(len(df.filter(...))). If the counts differ by one, check whether the boundary
date itself should be included or excluded, and verify whether >= or > is correct.
Read a spreadsheet with a two-row header
Run the script and look at the row count and the first few rows of the DataFrame. Do they match the data you expected to load?
import polars as pl
# Load survey results and report the number of respondents.
df = pl.read_csv("multilinecsv.csv")
print(f"Rows: {len(df)}")
print(df.head())
Source: Annual Survey 2023
Units: thousands
name,count,value
Alice,10,100
Bob,20,200
Carol,15,150
Show explanation
The bug is not passing skip_rows to skip the extra header lines, so Polars reads
the multi-line header as data and reports the wrong number of rows.
Shows: how to inspect the first few rows of a DataFrame with .head()
and how to use skip_rows and has_header to handle non-standard
file layouts.
To find it: print df.head() and compare the first few rows to the raw CSV. If the
first "data" row contains what looks like a column header, the loader read one or
more header lines as data. Check df.shape[0] against the expected row count.
Combine two datasets on a shared identifier
Run the script and examine the mean_amount column in the result. Are there any
null values where you did not expect them?
import polars as pl
sales = pl.DataFrame({
"Region": ["North", "South", "East", "West", "North", "South"],
"amount": [100, 200, 150, 175, 120, 210],
})
# Compute per-region mean sales.
means = sales.group_by("Region").agg(
pl.col("amount").mean().alias("mean_amount")
)
means = means.with_columns(pl.col("Region").str.to_lowercase())
result = sales.join(means, on="Region", how="left")
print(result)
Show explanation
The bug is joining on a column whose name differs by case ("Region" vs.
"region"), which Polars treats as different columns, so every row in the joined
output has a null for the group mean.
Shows: that Polars column names are case-sensitive and how to diagnose null-filled join results.
To find it: print df1.columns and df2.columns side by side. Look for a column
that appears in both but differs by capitalization. A join on mismatched names
produces a null-filled column for the unmatched side.
Sort records by date of collection
Run the script and check the schema of the DataFrame. What type does Polars assign
to the date column? How many rows does the filter return?
import polars as pl
CUTOFF = "2024-06-01"
df = pl.read_csv("datestring.csv")
result = df.filter(pl.col("date") > CUTOFF)
print(result)
event,date,count
launch,2024-03-01,42
review,2024-06-15,18
release,2024-09-30,75
followup,2024-12-01,31
Show explanation
The bug is that Polars read the date column as strings, so the comparison is lexicographic rather than chronological and the filter returns no rows even though matching rows exist.
Shows: how to inspect inferred column types with .schema, and how to
cast a column to pl.Date before filtering.
To find it: print df.schema and check the type of the date column. If it shows
String instead of Date, comparisons against a datetime value will return no
rows because string comparison is lexicographic, not chronological.
Compute per-group statistics before filtering
Run the script and compare the output totals to the values in the CSV file. Do the per-region totals make sense?
import polars as pl
df = pl.read_csv("aggorder.csv")
# Compute total sales per region.
result = (
df.select(pl.col("sales").sum())
.group_by(pl.lit("all"))
.agg(pl.col("sales").sum())
)
print(result)
region,product,sales
North,widget,100
North,gadget,200
South,widget,150
South,gadget,300
East,widget,120
East,gadget,180
Show explanation
The bug is calling .sum() before .group_by(), which sums the entire column first
and then groups a single-row DataFrame, producing unexpectedly large totals.
Shows: the importance of operation order in lazy and eager pipelines and how to verify intermediate results.
To find it: break the pipeline into two steps and print the DataFrame after each one.
After .sum() alone, you will see a single-row DataFrame — the sum already happened
before grouping.
Debug a pipeline that fails at the wrong step
Run the script and read the error message and traceback. Which step in the pipeline does the error appear to come from? Is that where the mistake actually is?
import polars as pl
data = pl.DataFrame({
"id": [1, 2, 3, 4],
"score": [88, 72, 95, 61],
"label": ["A", "B", "A", "C"],
})
# Build a lazy pipeline: rename "score" to "points", filter, then select.
result = (
data.lazy()
.rename({"score": "points"})
.filter(pl.col("points") > 70)
.select(["id", "score", "label"])
.collect()
)
print(result)
Show explanation
The bug is referencing a column that was renamed in an earlier step, so a
ColumnNotFoundError is raised at .collect() time rather than when the
transformation is written.
Shows: how Polars lazy evaluation defers errors and how to use
.collect() on intermediate steps to locate the failing
transformation.
To find it: insert .collect() after each transformation step and run the script
again. The first step where .collect() raises a ColumnNotFoundError is where
the broken reference is — even though the original error appeared only at the final
.collect().
Read a tab-separated export from a database
Run the script and examine the column names and values in the combined DataFrame. Are the columns what you expected?
import polars as pl
df_a = pl.read_csv("wrongdelim_a.csv")
df_b = pl.read_csv("wrongdelim_b.csv")
combined = pl.concat([df_a, df_b], how="diagonal")
print(f"Columns: {combined.columns}")
print(combined)
name,age,score
Alice,30,88
Bob,25,72
name;age;score
Carol;28;95
Dave;35;61
Show explanation
The bug is that the second file uses semicolons as delimiters, so Polars reads the
entire row as a single column. When concat is called with how="diagonal", missing
columns are filled with nulls and the result has twice as many columns as expected.
Shows: how to check column names and counts before concatenating DataFrames.
To find it: print df1.columns and df2.columns before the concat. If df2 has
one column whose name looks like an entire row — e.g., "id;name;value" — the file
uses a different delimiter than the one specified.
Average measurements that include missing values
Run the script and look at the mean. Then inspect the raw data. Are there any values in the column that seem unusually large?
import polars as pl
# 999 is a sentinel value meaning "no response recorded"; treat as missing.
SENTINEL = 999
df = pl.read_csv("sentinel.csv")
clean = df.with_columns(
pl.col("response_time").fill_null(None)
)
print(clean["response_time"].mean())
participant,age,response_time
P001,24,320
P002,31,999
P003,19,415
P004,45,999
P005,27,280
P006,38,999
P007,22,510
Show explanation
The bug is that the dataset uses 999 as a sentinel for missing data rather than a
true null, so .fill_null() has no effect on them and the mean is skewed by what
appear to be valid large numbers.
Shows: how to identify domain-specific sentinel values and replace
them with pl.Null before analysis.
To find it: print df["measurement"].max(). A suspiciously round large value like
999 in a column of biological measurements is a sentinel for "not recorded," not
a real reading. Replace it with pl.Null before computing the mean.
Group survey responses by category
Run the script and count the number of groups produced. Is it more than you
expected? Call .unique() on the grouping column and examine what you see.
import polars as pl
df = pl.read_csv("whitespace.csv")
result = df.group_by("region").agg(pl.col("sales").sum())
print(result)
region,sales
North ,100
North,120
South,200
South,180
East,150
East ,160
Show explanation
The bug is that a string column has inconsistent whitespace (e.g., "North " and
"North" are treated as different groups), so group_by followed by agg produces
more groups than expected.
Shows: how to inspect unique values with .unique(), use
.str.strip_chars() to normalize strings before grouping, and verify
group counts.
To find it: print df["category"].unique() and count the items. If you see
"North" and "North " as separate entries, trailing whitespace is causing the
split.
Smooth a noisy sensor signal
Run the script and count the null values in the rolling_mean column. Is the
number of null rows what you expected for a 7-day window?
import polars as pl
WINDOW = 7 # days
df = pl.read_csv("rolling.csv")
result = df.with_columns(
pl.col("value")
.rolling_mean(window_size=WINDOW)
.alias("rolling_mean")
)
print(result)
day,value
1,10
2,12
3,9
4,14
5,11
6,13
7,15
8,12
9,10
10,16
Show explanation
The bug is passing window_size=7 without setting min_periods=1, so any window
that cannot be fully filled returns null and the result has far more nulls than
expected.
Shows: how rolling aggregations handle incomplete windows and how to
choose between strict and lenient behavior with min_periods.
To find it: print df["rolling_mean"].is_null().sum() to count null values. If the
count is much larger than window_size - 1, the window requires more data points
than are available for most positions. Setting min_periods=1 allows partial windows
to produce a result.
Plot measurement values on a scatter chart
Run the script and open the saved chart in a browser. Do all the bars have heights
that reflect the value column?
import altair as alt
import polars as pl
data = pl.DataFrame({
"category": ["A", "B", "C", "D"],
"value": [10, 40, 25, 60],
}).to_pandas()
chart = alt.Chart(data).mark_bar().encode(
x=alt.X("category"),
y=alt.Y("value"),
)
chart.save("quanttype.html")
Show explanation
The bug is encoding the y-axis with alt.Y("value") without specifying
type="quantitative", so Altair treats the column as nominal and counts categories
instead of summing values, giving all bars the same height.
Shows: how Altair infers encoding types and why specifying type
explicitly avoids silent misinterpretation.
To find it: open the chart and count the unique bar heights. If every bar is the same
height, Altair is counting rows rather than summing values. Call
alt.Chart(data).mark_bar().encode(y="value").to_dict() and search for "type" in
the output to see what Altair inferred.
Color a chart by a numeric category
Run the script and open the saved chart. Does the color scale appear as a continuous gradient, or as a discrete set of colors?
import altair as alt
import polars as pl
# Simulated dataset where "temperature" was read from a CSV as strings.
data = pl.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 20, 15, 30, 25],
"temperature": ["1.2", "2.4", "3.5", "4.1", "5.0"],
}).to_pandas()
chart = alt.Chart(data).mark_point().encode(
x="x:Q",
y="y:Q",
color="temperature",
)
chart.save("colorscale.html")
Show explanation
The bug is that the color column was read as a string (e.g., "3.5") rather than a
float, so Altair applies a nominal color scale and the scatter plot shows a discrete
legend with arbitrary colors instead of a continuous gradient.
Shows: how data types in the source DataFrame determine Altair's default encoding choices.
To find it: print df.schema and check the type of the color column. If it shows
String instead of Float64, cast it with .cast(pl.Float64) before charting.
Plot measurements collected over time
Open the saved chart in a browser. Are the months arranged in chronological order along the x-axis, or in a different order?
import altair as alt
import polars as pl
data = pl.DataFrame({
"month": ["2024-01", "2024-02", "2024-03", "2024-04", "2024-05"],
"sales": [120, 95, 140, 110, 160],
}).to_pandas()
chart = alt.Chart(data).mark_line(point=True).encode(
x=alt.X("month:N"),
y=alt.Y("sales:Q"),
)
chart.save("temporal.html")
Show explanation
The bug is encoding the x-axis date column as type="nominal" instead of
type="temporal", so Altair does not order the points chronologically and the line
chart draws disconnected segments instead of a continuous line.
Shows: the difference between nominal and temporal encoding in Altair and how to verify axis ordering.
To find it: open the chart and check whether the x-axis dates are in chronological
order. If months appear alphabetically — e.g., April before August before December —
rather than in calendar order, check the alt.X(...) call for type=.
Add an interactive filter to a chart
Open the saved chart in a browser. Does it show only the categories whose count is 100 or more, or does it show all of them?
import altair as alt
import polars as pl
data = pl.DataFrame({
"category": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"],
"count": [50, 120, 30, 200, 80, 15, 175, 60, 90, 140, 25],
}).to_pandas()
# Show only categories with count >= 100.
chart = alt.Chart(data).mark_bar().encode(
x=alt.X("category:N"),
y=alt.Y("count:Q"),
).transform_filter(
alt.datum.Count >= 100
)
chart.save("filterfield.html")
Show explanation
The bug is a field name in alt.Filter that does not match any column. Altair
silently ignores the filter rather than raising an error, so all categories are
shown instead of just the top 10.
Shows: how to debug Altair transforms by inspecting the chart's JSON specification and checking field names match the data source.
To find it: call .to_dict() on the chart and search for the field name inside the
transform section. Compare it character by character to the actual column name in
the DataFrame — a one-character difference silently disables the filter.
Show column values in a chart tooltip
Open the saved chart in a browser and hover over a point. Does the Sales Region
field in the tooltip show a value?
import altair as alt
import polars as pl
data = pl.DataFrame({
"Product Name": ["Widget", "Gadget", "Doohickey"],
"Sales Region": ["North", "South", "East"],
"revenue": [1200, 800, 950],
}).to_pandas()
chart = alt.Chart(data).mark_point().encode(
x=alt.X("revenue:Q"),
tooltip=["Product Name:N", "Sales Region", "revenue:Q"],
)
chart.save("tooltip.html")
Show explanation
The bug is that the tooltip field name has a space in it (e.g., "Sales Region")
but is referenced without quoting in the Altair shorthand string, so the tooltip
shows null for that field even though the data contains values.
Shows: how Altair shorthand handles special characters and when to use
alt.Tooltip(field=…, title=…) instead.
To find it: hover over a point in the chart and note which tooltip field shows
null. Then print df.columns to find the exact column name. If the name contains
a space, the Altair shorthand parser stops at the space and the field is never
matched.
Pass a Polars dataframe to a chart library
Run the script and open the saved chart in a browser. Does the chart show any data points?
import altair as alt
import polars as pl
df = pl.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 25, 15, 30, 20],
})
chart = alt.Chart(df).mark_line().encode(
x="x:Q",
y="y:Q",
)
chart.save("polarsinaltair.html")
Show explanation
The bug is passing the Polars DataFrame directly to alt.Chart() instead of
converting it to a pandas DataFrame or using alt.Data, so the chart is blank.
Shows: which data formats Altair accepts natively and how to convert between Polars and the formats Altair supports.
To find it: print type(df) to confirm it is a Polars DataFrame, then open the
chart — if it is blank, Altair did not receive a supported data format. Convert with
df.to_pandas() and open the chart again.
Check whether two measurements are related
Run the script and note the correlation value. Then examine how metric_a and
metric_b are constructed. Should they really be perfectly correlated?
import polars as pl
data = pl.DataFrame({
"base": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0],
})
data = data.with_columns(
(pl.col("base") * 2.5).alias("metric_a"),
(pl.col("base") * 2.5 + 10.0).alias("metric_b"),
)
corr = data.select(pl.corr("metric_a", "metric_b")).item()
print(f"Correlation: {corr}")
Show explanation
The bug is that both columns were derived from the same source column in the same pipeline step (a copy rather than an independent transformation), so the correlation is exactly 1.0 for columns that should not be perfectly correlated.
Shows: how to audit column provenance in a pipeline and use scatter plots to sanity-check correlation claims.
To find it: print the expression used to create each column side by side. If both
are derived from df["x"] in the same step, they are the same data. A scatter plot
of metric_a vs. metric_b that falls on a perfect diagonal confirms it.
Process a large file in pieces
Run the script on a large file and watch how memory usage changes as the script runs. Does memory stay roughly constant, or does it grow?
import polars as pl
CHUNK_SIZE = 2 # rows per chunk
chunks = []
for chunk in pl.read_csv_batched("chunkaccum.csv", batch_size=CHUNK_SIZE):
chunks.append(chunk)
result = pl.concat(chunks)
print(result["value"].sum())
id,value
1,10
2,20
3,30
4,40
5,50
Show explanation
The bug is accumulating all chunks in memory before concatenating rather than processing each chunk and writing results incrementally, so the pipeline runs out of memory on large files.
Shows: streaming versus batch processing patterns and how to use
Polars' scan_csv with lazy evaluation to avoid loading the full
file.
To find it: add import tracemalloc; tracemalloc.start() at the top and print
tracemalloc.get_traced_memory() after processing each chunk. If the peak figure
grows linearly with the number of chunks, the chunks are being accumulated in memory
rather than discarded after each step.
Facet a chart by year of collection
Open the saved chart in a browser. How many facet panels does it show? Inspect the
type of the year column in the DataFrame.
import altair as alt
import polars as pl
data = pl.DataFrame({
"region": ["North", "South", "East", "North", "South", "East"],
"year": [2022.0, 2022.0, 2022.0, 2023.0, 2023.0, 2023.0],
"sales": [100, 150, 120, 130, 160, 140],
}).to_pandas()
chart = alt.Chart(data).mark_bar().encode(
x=alt.X("region:N"),
y=alt.Y("sales:Q"),
).facet(
facet="year:N",
columns=2,
)
chart.save("floatyear.html")
Show explanation
The bug is that the year column contains floats (e.g., 2021.0) because Polars
inferred it as Float64. Altair's facet treats each unique float as a separate
nominal value but the layout collapses to one panel due to the unexpected type.
Shows: how to cast integer-like columns to pl.Int32 before charting
and how to verify facet behavior with a small sample.
To find it: print df["year"].dtype and df["year"].unique(). If the type is
Float64 and the values show .0 suffixes, cast the column with .cast(pl.Int32)
before passing to Altair.
Run a notebook after restarting the kernel
Run this script from top to bottom. Does it raise an error? Which line causes the error?
import polars as pl
raw = pl.read_csv("filterboundary.csv")
summary = clean.group_by("product").agg(pl.col("value").mean()) # noqa: F821
clean = raw.filter(pl.col("value") > 40)
print(summary)
Show explanation
The bug is that the notebook had cells executed out of order, leaving a modified DataFrame in memory that masked an error in the cleaning step. The script produces correct results when run step by step in a notebook but wrong results when run as a script.
Shows: why notebooks must be tested by restarting the kernel and running all cells in order, and how to structure pipelines so each step depends only on its explicit inputs.
To find it: run the script as a plain Python file from a fresh process (not inside a
notebook). If it raises an error that never appeared during interactive execution,
a cell was run out of order. Add print(df.columns) at the start of each step to
confirm each step's input matches what the previous step produced.