Data Science

Select rows within a date range

Run the script and count the rows returned. Then count by hand how many rows in the CSV should satisfy the condition. Do the two counts agree?

import polars as pl

# Keep only rows where value EXCEEDS (strictly greater than) the threshold.
THRESHOLD = 50.0

df = pl.read_csv("filterboundary.csv")

result = df.filter(pl.col("value") >= THRESHOLD)
print(result)

product,value
alpha,30.0
beta,50.0
gamma,70.0
delta,50.0
epsilon,90.0

Show explanation

The bug is using >= instead of > (or vice versa) in the filter expression, so the script keeps rows it should drop.

Shows: how to verify filter logic by checking boundary values and using .filter() with explicit comparison operators.

To find it: count the matching rows by hand from the CSV, then compare to print(len(df.filter(...))). If the counts differ by one, check whether the boundary date itself should be included or excluded, and verify whether >= or > is correct.

Read a spreadsheet with a two-row header

Run the script and look at the row count and the first few rows of the DataFrame. Do they match the data you expected to load?

import polars as pl

# Load survey results and report the number of respondents.

df = pl.read_csv("multilinecsv.csv")
print(f"Rows: {len(df)}")
print(df.head())

Source: Annual Survey 2023
Units: thousands
name,count,value
Alice,10,100
Bob,20,200
Carol,15,150

Show explanation

The bug is not passing skip_rows to skip the extra header lines, so Polars reads the multi-line header as data and reports the wrong number of rows.

Shows: how to inspect the first few rows of a DataFrame with .head() and how to use skip_rows and has_header to handle non-standard file layouts.

To find it: print df.head() and compare the first few rows to the raw CSV. If the first "data" row contains what looks like a column header, the loader read one or more header lines as data. Check df.shape[0] against the expected row count.

Combine two datasets on a shared identifier

Run the script and examine the mean_amount column in the result. Are there any null values where you did not expect them?

import polars as pl

sales = pl.DataFrame({
    "Region": ["North", "South", "East", "West", "North", "South"],
    "amount": [100, 200, 150, 175, 120, 210],
})

# Compute per-region mean sales.
means = sales.group_by("Region").agg(
    pl.col("amount").mean().alias("mean_amount")
)

means = means.with_columns(pl.col("Region").str.to_lowercase())

result = sales.join(means, on="Region", how="left")
print(result)

Show explanation

The bug is joining on a column whose name differs by case ("Region" vs. "region"), which Polars treats as different columns, so every row in the joined output has a null for the group mean.

Shows: that Polars column names are case-sensitive and how to diagnose null-filled join results.

To find it: print df1.columns and df2.columns side by side. Look for a column that appears in both but differs by capitalization. A join on mismatched names produces a null-filled column for the unmatched side.

Sort records by date of collection

Run the script and check the schema of the DataFrame. What type does Polars assign to the date column? How many rows does the filter return?

import polars as pl

CUTOFF = "2024-06-01"

df = pl.read_csv("datestring.csv")

result = df.filter(pl.col("date") > CUTOFF)
print(result)

event,date,count
launch,2024-03-01,42
review,2024-06-15,18
release,2024-09-30,75
followup,2024-12-01,31

Show explanation

The bug is that Polars read the date column as strings, so the comparison is lexicographic rather than chronological and the filter returns no rows even though matching rows exist.

Shows: how to inspect inferred column types with .schema, and how to cast a column to pl.Date before filtering.

To find it: print df.schema and check the type of the date column. If it shows String instead of Date, comparisons against a datetime value will return no rows because string comparison is lexicographic, not chronological.

Compute per-group statistics before filtering

Run the script and compare the output totals to the values in the CSV file. Do the per-region totals make sense?

import polars as pl

df = pl.read_csv("aggorder.csv")

# Compute total sales per region.

result = (
    df.select(pl.col("sales").sum())
    .group_by(pl.lit("all"))
    .agg(pl.col("sales").sum())
)
print(result)

region,product,sales
North,widget,100
North,gadget,200
South,widget,150
South,gadget,300
East,widget,120
East,gadget,180

Show explanation

The bug is calling .sum() before .group_by(), which sums the entire column first and then groups a single-row DataFrame, producing unexpectedly large totals.

Shows: the importance of operation order in lazy and eager pipelines and how to verify intermediate results.

To find it: break the pipeline into two steps and print the DataFrame after each one. After .sum() alone, you will see a single-row DataFrame — the sum already happened before grouping.

Debug a pipeline that fails at the wrong step

Run the script and read the error message and traceback. Which step in the pipeline does the error appear to come from? Is that where the mistake actually is?

import polars as pl

data = pl.DataFrame({
    "id": [1, 2, 3, 4],
    "score": [88, 72, 95, 61],
    "label": ["A", "B", "A", "C"],
})

# Build a lazy pipeline: rename "score" to "points", filter, then select.
result = (
    data.lazy()
    .rename({"score": "points"})
    .filter(pl.col("points") > 70)
    .select(["id", "score", "label"])
    .collect()
)
print(result)

Show explanation

The bug is referencing a column that was renamed in an earlier step, so a ColumnNotFoundError is raised at .collect() time rather than when the transformation is written.

Shows: how Polars lazy evaluation defers errors and how to use .collect() on intermediate steps to locate the failing transformation.

To find it: insert .collect() after each transformation step and run the script again. The first step where .collect() raises a ColumnNotFoundError is where the broken reference is — even though the original error appeared only at the final .collect().

Read a tab-separated export from a database

Run the script and examine the column names and values in the combined DataFrame. Are the columns what you expected?

import polars as pl

df_a = pl.read_csv("wrongdelim_a.csv")

df_b = pl.read_csv("wrongdelim_b.csv")

combined = pl.concat([df_a, df_b], how="diagonal")
print(f"Columns: {combined.columns}")
print(combined)

name,age,score
Alice,30,88
Bob,25,72

name;age;score
Carol;28;95
Dave;35;61

Show explanation

The bug is that the second file uses semicolons as delimiters, so Polars reads the entire row as a single column. When concat is called with how="diagonal", missing columns are filled with nulls and the result has twice as many columns as expected.

Shows: how to check column names and counts before concatenating DataFrames.

To find it: print df1.columns and df2.columns before the concat. If df2 has one column whose name looks like an entire row — e.g., "id;name;value" — the file uses a different delimiter than the one specified.

Average measurements that include missing values

Run the script and look at the mean. Then inspect the raw data. Are there any values in the column that seem unusually large?

import polars as pl

# 999 is a sentinel value meaning "no response recorded"; treat as missing.
SENTINEL = 999

df = pl.read_csv("sentinel.csv")

clean = df.with_columns(
    pl.col("response_time").fill_null(None)
)
print(clean["response_time"].mean())

participant,age,response_time
P001,24,320
P002,31,999
P003,19,415
P004,45,999
P005,27,280
P006,38,999
P007,22,510

Show explanation

The bug is that the dataset uses 999 as a sentinel for missing data rather than a true null, so .fill_null() has no effect on them and the mean is skewed by what appear to be valid large numbers.

Shows: how to identify domain-specific sentinel values and replace them with pl.Null before analysis.

To find it: print df["measurement"].max(). A suspiciously round large value like 999 in a column of biological measurements is a sentinel for "not recorded," not a real reading. Replace it with pl.Null before computing the mean.

Group survey responses by category

Run the script and count the number of groups produced. Is it more than you expected? Call .unique() on the grouping column and examine what you see.

import polars as pl

df = pl.read_csv("whitespace.csv")

result = df.group_by("region").agg(pl.col("sales").sum())
print(result)

region,sales
North ,100
North,120
South,200
 South,180
East,150
East ,160

Show explanation

The bug is that a string column has inconsistent whitespace (e.g., "North " and "North" are treated as different groups), so group_by followed by agg produces more groups than expected.

Shows: how to inspect unique values with .unique(), use .str.strip_chars() to normalize strings before grouping, and verify group counts.

To find it: print df["category"].unique() and count the items. If you see "North" and "North " as separate entries, trailing whitespace is causing the split.

Smooth a noisy sensor signal

Run the script and count the null values in the rolling_mean column. Is the number of null rows what you expected for a 7-day window?

import polars as pl

WINDOW = 7  # days

df = pl.read_csv("rolling.csv")

result = df.with_columns(
    pl.col("value")
    .rolling_mean(window_size=WINDOW)
    .alias("rolling_mean")
)
print(result)

day,value
1,10
2,12
3,9
4,14
5,11
6,13
7,15
8,12
9,10
10,16

Show explanation

The bug is passing window_size=7 without setting min_periods=1, so any window that cannot be fully filled returns null and the result has far more nulls than expected.

Shows: how rolling aggregations handle incomplete windows and how to choose between strict and lenient behavior with min_periods.

To find it: print df["rolling_mean"].is_null().sum() to count null values. If the count is much larger than window_size - 1, the window requires more data points than are available for most positions. Setting min_periods=1 allows partial windows to produce a result.

Plot measurement values on a scatter chart

Run the script and open the saved chart in a browser. Do all the bars have heights that reflect the value column?

import altair as alt
import polars as pl

data = pl.DataFrame({
    "category": ["A", "B", "C", "D"],
    "value": [10, 40, 25, 60],
}).to_pandas()

chart = alt.Chart(data).mark_bar().encode(
    x=alt.X("category"),
    y=alt.Y("value"),
)
chart.save("quanttype.html")

Show explanation

The bug is encoding the y-axis with alt.Y("value") without specifying type="quantitative", so Altair treats the column as nominal and counts categories instead of summing values, giving all bars the same height.

Shows: how Altair infers encoding types and why specifying type explicitly avoids silent misinterpretation.

To find it: open the chart and count the unique bar heights. If every bar is the same height, Altair is counting rows rather than summing values. Call alt.Chart(data).mark_bar().encode(y="value").to_dict() and search for "type" in the output to see what Altair inferred.

Color a chart by a numeric category

Run the script and open the saved chart. Does the color scale appear as a continuous gradient, or as a discrete set of colors?

import altair as alt
import polars as pl

# Simulated dataset where "temperature" was read from a CSV as strings.
data = pl.DataFrame({
    "x": [1, 2, 3, 4, 5],
    "y": [10, 20, 15, 30, 25],
    "temperature": ["1.2", "2.4", "3.5", "4.1", "5.0"],
}).to_pandas()

chart = alt.Chart(data).mark_point().encode(
    x="x:Q",
    y="y:Q",
    color="temperature",
)
chart.save("colorscale.html")

Show explanation

The bug is that the color column was read as a string (e.g., "3.5") rather than a float, so Altair applies a nominal color scale and the scatter plot shows a discrete legend with arbitrary colors instead of a continuous gradient.

Shows: how data types in the source DataFrame determine Altair's default encoding choices.

To find it: print df.schema and check the type of the color column. If it shows String instead of Float64, cast it with .cast(pl.Float64) before charting.

Plot measurements collected over time

Open the saved chart in a browser. Are the months arranged in chronological order along the x-axis, or in a different order?

import altair as alt
import polars as pl

data = pl.DataFrame({
    "month": ["2024-01", "2024-02", "2024-03", "2024-04", "2024-05"],
    "sales": [120, 95, 140, 110, 160],
}).to_pandas()

chart = alt.Chart(data).mark_line(point=True).encode(
    x=alt.X("month:N"),
    y=alt.Y("sales:Q"),
)
chart.save("temporal.html")

Show explanation

The bug is encoding the x-axis date column as type="nominal" instead of type="temporal", so Altair does not order the points chronologically and the line chart draws disconnected segments instead of a continuous line.

Shows: the difference between nominal and temporal encoding in Altair and how to verify axis ordering.

To find it: open the chart and check whether the x-axis dates are in chronological order. If months appear alphabetically — e.g., April before August before December — rather than in calendar order, check the alt.X(...) call for type=.

Add an interactive filter to a chart

Open the saved chart in a browser. Does it show only the categories whose count is 100 or more, or does it show all of them?

import altair as alt
import polars as pl

data = pl.DataFrame({
    "category": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"],
    "count": [50, 120, 30, 200, 80, 15, 175, 60, 90, 140, 25],
}).to_pandas()

# Show only categories with count >= 100.

chart = alt.Chart(data).mark_bar().encode(
    x=alt.X("category:N"),
    y=alt.Y("count:Q"),
).transform_filter(
    alt.datum.Count >= 100
)
chart.save("filterfield.html")

Show explanation

The bug is a field name in alt.Filter that does not match any column. Altair silently ignores the filter rather than raising an error, so all categories are shown instead of just the top 10.

Shows: how to debug Altair transforms by inspecting the chart's JSON specification and checking field names match the data source.

To find it: call .to_dict() on the chart and search for the field name inside the transform section. Compare it character by character to the actual column name in the DataFrame — a one-character difference silently disables the filter.

Open the saved chart in a browser and hover over a point. Does the Sales Region field in the tooltip show a value?

import altair as alt
import polars as pl

data = pl.DataFrame({
    "Product Name": ["Widget", "Gadget", "Doohickey"],
    "Sales Region": ["North", "South", "East"],
    "revenue": [1200, 800, 950],
}).to_pandas()

chart = alt.Chart(data).mark_point().encode(
    x=alt.X("revenue:Q"),
    tooltip=["Product Name:N", "Sales Region", "revenue:Q"],
)
chart.save("tooltip.html")

Show explanation

The bug is that the tooltip field name has a space in it (e.g., "Sales Region") but is referenced without quoting in the Altair shorthand string, so the tooltip shows null for that field even though the data contains values.

Shows: how Altair shorthand handles special characters and when to use alt.Tooltip(field=…, title=…) instead.

To find it: hover over a point in the chart and note which tooltip field shows null. Then print df.columns to find the exact column name. If the name contains a space, the Altair shorthand parser stops at the space and the field is never matched.

Pass a Polars dataframe to a chart library

Run the script and open the saved chart in a browser. Does the chart show any data points?

import altair as alt
import polars as pl

df = pl.DataFrame({
    "x": [1, 2, 3, 4, 5],
    "y": [10, 25, 15, 30, 20],
})

chart = alt.Chart(df).mark_line().encode(
    x="x:Q",
    y="y:Q",
)
chart.save("polarsinaltair.html")

Show explanation

The bug is passing the Polars DataFrame directly to alt.Chart() instead of converting it to a pandas DataFrame or using alt.Data, so the chart is blank.

Shows: which data formats Altair accepts natively and how to convert between Polars and the formats Altair supports.

To find it: print type(df) to confirm it is a Polars DataFrame, then open the chart — if it is blank, Altair did not receive a supported data format. Convert with df.to_pandas() and open the chart again.

Check whether two measurements are related

Run the script and note the correlation value. Then examine how metric_a and metric_b are constructed. Should they really be perfectly correlated?

import polars as pl

data = pl.DataFrame({
    "base": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0],
})

data = data.with_columns(
    (pl.col("base") * 2.5).alias("metric_a"),
    (pl.col("base") * 2.5 + 10.0).alias("metric_b"),
)

corr = data.select(pl.corr("metric_a", "metric_b")).item()
print(f"Correlation: {corr}")

Show explanation

The bug is that both columns were derived from the same source column in the same pipeline step (a copy rather than an independent transformation), so the correlation is exactly 1.0 for columns that should not be perfectly correlated.

Shows: how to audit column provenance in a pipeline and use scatter plots to sanity-check correlation claims.

To find it: print the expression used to create each column side by side. If both are derived from df["x"] in the same step, they are the same data. A scatter plot of metric_a vs. metric_b that falls on a perfect diagonal confirms it.

Process a large file in pieces

Run the script on a large file and watch how memory usage changes as the script runs. Does memory stay roughly constant, or does it grow?

import polars as pl

CHUNK_SIZE = 2  # rows per chunk

chunks = []
for chunk in pl.read_csv_batched("chunkaccum.csv", batch_size=CHUNK_SIZE):
    chunks.append(chunk)

result = pl.concat(chunks)
print(result["value"].sum())

id,value
1,10
2,20
3,30
4,40
5,50

Show explanation

The bug is accumulating all chunks in memory before concatenating rather than processing each chunk and writing results incrementally, so the pipeline runs out of memory on large files.

Shows: streaming versus batch processing patterns and how to use Polars' scan_csv with lazy evaluation to avoid loading the full file.

To find it: add import tracemalloc; tracemalloc.start() at the top and print tracemalloc.get_traced_memory() after processing each chunk. If the peak figure grows linearly with the number of chunks, the chunks are being accumulated in memory rather than discarded after each step.

Facet a chart by year of collection

Open the saved chart in a browser. How many facet panels does it show? Inspect the type of the year column in the DataFrame.

import altair as alt
import polars as pl

data = pl.DataFrame({
    "region": ["North", "South", "East", "North", "South", "East"],
    "year": [2022.0, 2022.0, 2022.0, 2023.0, 2023.0, 2023.0],
    "sales": [100, 150, 120, 130, 160, 140],
}).to_pandas()

chart = alt.Chart(data).mark_bar().encode(
    x=alt.X("region:N"),
    y=alt.Y("sales:Q"),
).facet(
    facet="year:N",
    columns=2,
)
chart.save("floatyear.html")

Show explanation

The bug is that the year column contains floats (e.g., 2021.0) because Polars inferred it as Float64. Altair's facet treats each unique float as a separate nominal value but the layout collapses to one panel due to the unexpected type.

Shows: how to cast integer-like columns to pl.Int32 before charting and how to verify facet behavior with a small sample.

To find it: print df["year"].dtype and df["year"].unique(). If the type is Float64 and the values show .0 suffixes, cast the column with .cast(pl.Int32) before passing to Altair.

Run a notebook after restarting the kernel

Run this script from top to bottom. Does it raise an error? Which line causes the error?

import polars as pl


raw = pl.read_csv("filterboundary.csv")

summary = clean.group_by("product").agg(pl.col("value").mean())  # noqa: F821

clean = raw.filter(pl.col("value") > 40)

print(summary)

Show explanation

The bug is that the notebook had cells executed out of order, leaving a modified DataFrame in memory that masked an error in the cleaning step. The script produces correct results when run step by step in a notebook but wrong results when run as a script.

Shows: why notebooks must be tested by restarting the kernel and running all cells in order, and how to structure pipelines so each step depends only on its explicit inputs.

To find it: run the script as a plain Python file from a fresh process (not inside a notebook). If it raises an error that never appeared during interactive execution, a cell was run out of order. Add print(df.columns) at the start of each step to confirm each step's input matches what the previous step produced.

Data Science

Select rows within a date range

Read a spreadsheet with a two-row header

Combine two datasets on a shared identifier

Sort records by date of collection

Compute per-group statistics before filtering

Debug a pipeline that fails at the wrong step

Read a tab-separated export from a database

Average measurements that include missing values

Group survey responses by category

Smooth a noisy sensor signal

Plot measurement values on a scatter chart

Color a chart by a numeric category

Plot measurements collected over time

Add an interactive filter to a chart

Show column values in a chart tooltip

Pass a Polars dataframe to a chart library

Check whether two measurements are related

Process a large file in pieces

Facet a chart by year of collection

Run a notebook after restarting the kernel