Summarizing

Goals

What "Group By" Means

What does it mean to group data, and what kinds of questions does it answer?

Why would you want both the total and the mean for the same group?

Computing Totals and Means

Show me total and mean contract value by department for each year, sorted by department name.

or

Using Polars, read contracts.csv. Extract the year from the contract_date column. Compute the total and mean contract_value by department_en and year. Sort by department name, then year. Print the first twenty rows.

i
import polars as pl

df = pl.read_csv("contracts.csv")
df = df.with_columns(
    pl.col("contract_date").str.slice(0, 4).cast(pl.Int64).alias("year")
)

summary = (
    df.group_by(["department_en", "year"])
    .agg([
        pl.col("contract_value").sum().alias("total"),
        pl.col("contract_value").mean().alias("mean"),
        pl.col("contract_value").count().alias("n"),
    ])
    .sort(["department_en", "year"])
)
print(summary.head(20))

Adding Confidence Intervals

Add a 95% confidence interval to each group mean.

or

Extend the previous code to also compute a 95% confidence interval for the mean contract value in each department-year group. Add columns ci_low and ci_high to the output.

i
import polars as pl

# T_CRIT = 1.96 approximates the 97.5th percentile of the t-distribution
# for large samples (n > 30). Suitable for the many-contract groups in this dataset.
T_CRIT = 1.96

df = pl.read_csv("contracts.csv")
df = df.with_columns(
    pl.col("contract_date").str.slice(0, 4).cast(pl.Int64).alias("year")
)

summary = (
    df.group_by(["department_en", "year"])
    .agg([
        pl.col("contract_value").sum().alias("total"),
        pl.col("contract_value").mean().alias("mean"),
        pl.col("contract_value").count().alias("n"),
        pl.col("contract_value").std().alias("std"),
    ])
    .sort(["department_en", "year"])
    .with_columns([
        (pl.col("mean") - T_CRIT * pl.col("std") / pl.col("n").sqrt()).alias("ci_low"),
        (pl.col("mean") + T_CRIT * pl.col("std") / pl.col("n").sqrt()).alias("ci_high"),
    ])
)
print(summary.head(20))

Reading a Confidence Interval

What does a 95% confidence interval actually mean?

Checking Against a Published Figure

The government publishes the total value of contracts. Let us compare our sum to that published number.

or

Compute the total contract_value across all departments for the most recent year in the dataset.

Check Understanding

You compute a 95% confidence interval of ($480K, $620K) for mean contract size in one department. A classmate says "there is a 95% chance the true mean is between $480K and $620K." What is wrong with this statement?

The true mean is a fixed value: it either is or is not in the interval, with probability 1 or 0. The 95% refers to the procedure: if you built many intervals from different samples, 95% of them would contain the true mean. This particular interval either contains the true mean or does not; we just do not know which.

Two departments have mean contract values of $350K and $510K, with 95% confidence intervals of ($280K, $420K) and ($450K, $570K). Do the intervals overlap? What can you conclude?

The intervals do not overlap: the first ends at $420K and the second starts at $450K. This suggests the difference is real: large-sample departments with non-overlapping intervals are probably genuinely different in typical contract size. A formal hypothesis test would confirm this more rigorously, but non-overlapping intervals are good evidence that the departments procure at different scales.

Your computed total for one department is $2.8 billion, but the Public Accounts show $1.4 billion. List two things you would check before concluding the data is wrong.

First, check whether the years match: the dataset might include a different fiscal year than the Public Accounts reference, or might include amendments and modifications not in the original totals. Second, check whether the LLM summed all rows or only a subset: it may have failed to filter by department correctly and included other departments. Ask the LLM to print the number of rows it summed and the range of years included.

A department has only two contracts in one year, with values of $15K and $4.8M. The confidence interval for the mean is ($-3.2M, $8.0M). What does this tell you?

The interval is enormous and includes negative values, which is meaningless for a contract value. Two observations are nowhere near enough to estimate a mean reliably. This department-year combination should either be excluded from comparisons or clearly flagged as having too few observations to interpret.

Exercises

Largest Vendors

Find the five vendors with the highest total contract value across all departments and years. Are any surprising?

Year-Over-Year Change

For one department of your choice, compute the total contract value for each year in the dataset. Ask the LLM to compute the percentage change from one year to the next. Plot the result as a bar chart.

Wide Intervals

Find all department-year combinations where the 95% confidence interval for the mean is wider than $500 000. What do those groups have in common?

Plot the Intervals

Ask the LLM to create a chart showing the mean contract value and 95% confidence interval for each department in the most recent year. Sort the departments from highest to lowest mean.

Sum Before Group

The following code is meant to compute the mean contract value per department per year, but the output has only one row instead of one row per department. Work with an LLM to explain what went wrong and fix the pipeline.

i
import polars as pl

df = pl.read_csv("contracts.csv")
df = df.with_columns(
    pl.col("contract_date").str.slice(0, 4).cast(pl.Int64).alias("year")
)

summary = (
    df.select(pl.col("contract_value").sum())
    .group_by("department_en")
    .agg(pl.col("contract_value").mean().alias("mean"))
    .sort("department_en")
)
print(summary)
How do you know the fix worked?

The corrected output should have one row per unique combination of department and year. Count the distinct department names in the result and compare to the number of departments in the original file.

Month Instead of Year

The following code groups contracts by year, but the year column in the output contains values between 1 and 12 instead of four-digit years. Work with an LLM to find the error and fix it.

i
import polars as pl

df = pl.read_csv("contracts.csv")
df = df.with_columns(
    pl.col("contract_date").str.to_date("%Y-%m-%d").dt.month().alias("year")
)

summary = (
    df.group_by(["department_en", "year"])
    .agg([
        pl.col("contract_value").sum().alias("total"),
        pl.col("contract_value").mean().alias("mean"),
    ])
    .sort(["department_en", "year"])
)
print(summary.head(20))
How do you know the fix worked?

Print the distinct values in the year column of the summary. They should be four-digit years matching the range of dates in contracts.csv.

Flag Unreliable Intervals

The following code computes 95% confidence intervals for the mean contract value per department per year. Work with an LLM to extend it so it also adds a column ci_reliable that is True when the group has five or more contracts and False otherwise, then prints only the unreliable rows.

i
import polars as pl

T_CRIT = 1.96

df = pl.read_csv("contracts.csv")
df = df.with_columns(
    pl.col("contract_date").str.slice(0, 4).cast(pl.Int64).alias("year")
)

summary = (
    df.group_by(["department_en", "year"])
    .agg([
        pl.col("contract_value").mean().alias("mean"),
        pl.col("contract_value").count().alias("n"),
        pl.col("contract_value").std().alias("std"),
    ])
    .sort(["department_en", "year"])
    .with_columns([
        (pl.col("mean") - T_CRIT * pl.col("std") / pl.col("n").sqrt()).alias("ci_low"),
        (pl.col("mean") + T_CRIT * pl.col("std") / pl.col("n").sqrt()).alias("ci_high"),
    ])
)
print(summary.head(20))
# TODO: add a column "ci_reliable" that is True when n >= 5 and False otherwise,
# then print only the rows where ci_reliable is False
How do you know the addition is correct?

Find one department-year group with a very small n in the output of add_ci.py and confirm it appears in the ci_reliable == False rows. Also check that no group with n >= 5 appears there.

Wrong Grouping Column

The following code is meant to find the top 10 vendors by total contract value, but the output shows department names instead of vendor names. Work with an LLM to identify the wrong column and fix it.

i
import polars as pl

df = pl.read_csv("contracts.csv")

top_vendors = (
    df.group_by("department_en")
    .agg(pl.col("contract_value").sum().alias("total_value"))
    .sort("total_value", descending=True)
    .head(10)
)
print("Top 10 vendors by total contract value:")
print(top_vendors)
How do you know the fix worked?

The output should contain company and supplier names, not the names of government departments. Spot-check one of the top vendors by searching for their name in the contracts CSV and summing their rows by hand.