Polars
Add a computed column to a DataFrame
Run the script and look at the columns in the result. How many columns does it have? How many columns did the original DataFrame have?
import polars as pl
df = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Carol"],
"score": [88, 72, 95],
})
result = df.select([
(pl.col("score") * 1.1).alias("adjusted"),
])
print(result)
Show explanation
The bug is using .select() to add a new column. .select() returns only the
columns listed in the call and drops all others, so id and name are lost.
Shows: the difference between .select() (choose columns) and .with_columns()
(add or replace columns while keeping the rest).
To find it: print result.columns and compare to df.columns. If the result
has fewer columns than the original, .select() dropped them. Find the .select()
call in the pipeline and replace it with .with_columns().
Filter and select rows from a dataset
Run the script and look at what is printed. Is the output a table of data, or
something else? What type does type(result) report?
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"score": [88, 72, 95, 61],
})
result = (
df.lazy()
.filter(pl.col("score") >= 80)
.select(["name", "score"])
)
print(type(result))
print(result)
Show explanation
The bug is calling .lazy() to start a lazy pipeline but never calling .collect()
at the end, so the result is a LazyFrame (a query plan) rather than a DataFrame.
The filter and select have not executed.
Shows: the difference between eager and lazy evaluation in Polars and
when .collect() is required.
To find it: add print(type(result)) after the pipeline. Seeing
<class 'polars.LazyFrame'> instead of <class 'polars.DataFrame'> confirms the
pipeline never executed. Add .collect() at the end of the chain.
Find rows with missing scores
Run the script. How many rows does it print? How many rows contain a missing score?
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"score": [88, None, 95, None],
})
result = df.filter(pl.col("score") == None) # noqa: E711
print(f"rows with missing score: {len(result)}")
print(result)
Show explanation
The bug is using == None to test for missing values. In Polars, any comparison
involving null yields null rather than a boolean, so every row in the filtered
result is null and the filter keeps nothing.
Shows: null semantics in Polars and how to use .is_null() to
correctly select rows where a value is missing.
To find it: print df["score"].is_null().sum() to count actual nulls, then
compare to len(result) after the filter. If the filter returns 0 rows but the
null count is positive, the comparison == None is the problem — replace it with
.is_null().
Convert float scores to integers
Run the script and compare the values in the score column to the expected rounded
values printed below them. What happened to 88.7 and 95.6?
import polars as pl
df = pl.DataFrame({
"id": [1, 2, 3, 4],
"score": [88.7, 72.3, 95.6, 61.9],
})
result = df.with_columns(pl.col("score").cast(pl.Int64))
print(result)
print("expected if rounded:", [round(x) for x in [88.7, 72.3, 95.6, 61.9]])
Show explanation
The bug is that cast(pl.Int64) truncates toward zero rather than rounding, so
88.7 becomes 88 and 95.6 becomes 95 instead of 89 and 96. No error is raised.
Shows: that integer casting in Polars is a truncation operation, and
how to use .round(0).cast(pl.Int64) when rounding behavior is
intended.
To find it: print df.select(pl.col("score"),
pl.col("score").round(0).cast(pl.Int64).alias("rounded"),
and pl.col("score").cast(pl.Int64).alias("truncated")).
For a value like 88.7, the rounded column shows 89 and the truncated column
shows 88. The mismatch reveals that .cast() alone is discarding the fractional
part rather than rounding.
Add a department average salary to each employee row
Run the script. How many rows does the output have? How many rows did you expect?
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
"dept": ["eng", "eng", "hr", "hr", "eng"],
"salary": [90000, 85000, 70000, 75000, 92000],
})
result = df.group_by("dept").agg(pl.col("salary").mean().alias("dept_mean"))
print(f"input rows: {len(df)}, output rows: {len(result)}")
print(result)
Show explanation
The bug is using group_by().agg() when the goal is to add a per-row column showing
each employee's department mean. group_by().agg() collapses the DataFrame to one row
per group.
Shows: the difference between aggregation (which reduces rows) and
window functions (which compute a value per row), and how to use
.over() inside .with_columns() to attach group statistics to every
row.
To find it: print len(result) and compare to len(df). If the result has one
row per department instead of one per employee, group_by().agg() collapsed the
rows. Replace the group_by().agg() pipeline with
.with_columns(pl.col("salary").mean().over("department").alias("dept_avg")).
Join orders to customer records
Run the script and compare the number of input orders to the number of rows in the result. Which order is missing, and why?
import polars as pl
orders = pl.DataFrame({
"order_id": [1, 2, 3, 4],
"customer_id": [10, 20, 30, 99],
"amount": [50.0, 75.0, 30.0, 20.0],
})
customers = pl.DataFrame({
"customer_id": [10, 20, 30],
"name": ["Alice", "Bob", "Carol"],
})
result = orders.join(customers, on="customer_id")
print(f"orders in input : {len(orders)}")
print(f"rows after join : {len(result)}")
print(result)
Show explanation
The bug is using the default join, which is inner. Order 4 has a customer_id of 99
that does not appear in the customers table, so it is silently dropped.
Shows: the difference between inner and left joins, how to specify
how="left" to retain all rows from the left table, and how to verify
row counts before and after a join.
To find it: print len(orders) before the join and len(result) after. If they
differ, rows were dropped. To identify which, print
orders["customer_id"].is_in(customers["customer_id"]).value_counts(). Any
False entries name the customer IDs that have no match and will be lost in an
inner join.
Add a total price column to an order table
Run the script and compare the column names and values to the original DataFrame. Which column was overwritten, and which column was supposed to be added?
import polars as pl
df = pl.DataFrame({
"price": [10.0, 20.0, 30.0],
"qty": [3, 1, 2],
})
result = df.with_columns(pl.col("price") * pl.col("qty"))
print(result)
Show explanation
The bug is omitting .alias() on the expression. Without an explicit name, Polars
assigns the column the name of the left operand ("price"), which silently replaces
the original price column with the product values instead of adding a new total
column.
Shows: how Polars names unnamed expressions and why .alias() is
needed whenever the result should have a different name from its
inputs.
To find it: print result.columns and result.head(). If the price column now
contains the product of price * quantity rather than unit prices, the expression
was assigned back to price rather than creating a new column. Add
.alias("total") to the expression inside .with_columns().
Parse dates from a CSV file
Run the script and look at the parsed date column. Is "03/04/2024" shown as
April 3rd or March 4th? What date was intended?
import polars as pl
df = pl.DataFrame({
"event": ["conference", "deadline", "review"],
"date_str": ["03/04/2024", "07/08/2024", "11/12/2024"],
})
result = df.with_columns(
pl.col("date_str").str.to_date(format="%m/%d/%Y").alias("date")
)
print(result)
Show explanation
The bug is a mismatch between the data order (day/month/year) and the format string
(%m/%d/%Y, month/day/year). Because all day and month values are 12 or below,
every date parses without error, but each one is wrong.
Shows: how ambiguous numeric date formats cause silent data corruption, and why checking a few parsed values against known inputs is necessary to confirm the format string is correct.
To find it: print the first few rows of the parsed date column alongside the
original string values. Find a row where the day value is greater than 12 (for
example, "15/04/2024") and check whether the parsed month is 15 or the parsed
day is 15. Only one interpretation is possible, and it immediately reveals whether
%d and %m are swapped in the format string.
Split a tags column into one row per tag
Run the script and read the error message. What type does Polars report for the
tags column?
import polars as pl
df = pl.DataFrame({
"id": [1, 2],
"tags": ["python,data,science", "web,api"],
})
result = df.explode("tags")
print(result)
Show explanation
The bug is calling .explode() on a column that contains plain strings rather than
lists. Polars raises an InvalidOperationError because .explode() requires a
list-type column.
Shows: how to convert a delimited string column into a list column
with .str.split() before calling .explode(), and how to check
column types with .schema before applying list operations.
To find it: print df.schema and check the type of the tags column. Seeing
String instead of List[String] explains the InvalidOperationError:
.explode() requires a list-type column. Add
.with_columns(pl.col("tags").str.split(", ")) before the .explode() call.
Compute a discounted price and order total in one step
Run the script and read the error message. Which column is reported as not found? Is that column present in the original DataFrame?
import polars as pl
df = pl.DataFrame({
"price": [100.0, 200.0, 300.0],
"tax_rate": [0.1, 0.2, 0.1],
})
result = df.with_columns([
(pl.col("price") * 0.9).alias("discounted_price"),
(pl.col("discounted_price") * (1 + pl.col("tax_rate"))).alias("total"),
])
print(result)
Show explanation
The bug is referencing discounted_price in the same .with_columns() call where
it is first computed. All expressions in a single .with_columns() call are
evaluated against the original DataFrame, so discounted_price does not yet exist
when total is computed, and Polars raises a ColumnNotFoundError.
Shows: how Polars evaluates expressions in parallel within one call
and how to chain two separate .with_columns() calls when one result
depends on another.
To find it: read the ColumnNotFoundError, which names discounted_price as
missing. Search the code for where discounted_price is created and where it is
referenced. If both appear in the same .with_columns() call, split them: put
the expression that creates discounted_price in one .with_columns() call and
the expression that uses it in a second chained call.