Checking Your Work

Goals

Verify LLM output by checking it against known values and common sense.
Interpret a hypothesis test result without misreading it as certainty.
Recognize that a confident-sounding LLM answer is not a correct answer.

The Unit Error

What happens when the LLM silently uses the wrong units?

The dataset for this session comes from the Canada Energy Regulator and Natural Resources Canada [cer2025]
- Download two tables from Canada Energy Regulator:
  - Provincial natural gas production reported in thousands of cubic metres (10³ m³)
  - Provincial natural gas consumption reported in millions of cubic metres (10⁶ m³)
- Both tables cover the same years and provinces but use different prefixes
When you ask the LLM to compare production to consumption, it reads both columns and divides them
- The code runs without error and returns a number
- That number is off by a factor of 1 000, because 1 million m³ = 1 000 thousand m³
- Nothing in the code signals that anything is wrong

import polars as pl

production = pl.read_csv("gas_production_tcm.csv")   # units: 10³ m³ (thousands)
consumption = pl.read_csv("gas_consumption_mcm.csv")  # units: 10⁶ m³ (millions)

# BUG: treats 10³ m³ and 10⁶ m³ as the same unit.
# The ratio will be off by a factor of 1 000.
ratio = production["production"].mean() / consumption["consumption"].mean()
print(f"Production-to-consumption ratio: {ratio:.4f}")
print("(This number is wrong if the units have not been converted.)")

Run the cell and look at the ratio
- Alberta produces far more gas than it consumes: a ratio well above 1 for a major exporting province is plausible
- If the ratio is in the hundreds or in the thousandths, there is a unit error

Checking Without Reading Code

What strategies let me catch this error without reading the code?

Check a single known value by hand
- The Canada Energy Regulator publishes summary statistics; find Alberta's production for one recent year
- If the CER says Alberta produced approximately 163 billion m³ in 2022, and your file shows 163 000 for that row, the column is in millions of m³ (163 000 × 10⁶ = 163 × 10⁹ m³)
- Confirm the units before computing any ratio
Ask the LLM what units it assumed
- Paste the code into a new prompt: "What units did you assume for each column in this code?"
- A well-functioning LLM will identify the unit columns; a poorly-prompted one will guess
Compare to a published figure
- If your computed ratio is 1 000 or 0.001, the units do not match

The Correct Comparison

Fix the code so both columns use the same unit before comparing.

The production column is in 10³ m³ (thousands of cubic metres) and the consumption column is in 10⁶ m³ (millions of cubic metres). Convert the production column to 10⁶ m³ by dividing by 1000 before computing the ratio.

The LLM will produce something like:

import polars as pl

# 1 × 10⁶ m³ = 1 000 × 10³ m³
# Divide production (10³ m³) by 1 000 to convert to 10⁶ m³ before comparing.
TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
consumption = pl.read_csv("gas_consumption_mcm.csv")

production_mcm = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

ratio = production_mcm["production_mcm"].mean() / consumption["consumption"].mean()
print(f"Production (mean, 10⁶ m³): {production_mcm['production_mcm'].mean():.1f}")
print(f"Consumption (mean, 10⁶ m³): {consumption['consumption'].mean():.1f}")
print(f"Ratio (production / consumption, same units): {ratio:.2f}")

Run the cell and check: does the result match the by-hand estimate from the previous section?

Running a Hypothesis Test

Run a t-test to check whether natural gas production differs significantly between Alberta and British Columbia.

Using Polars and scipy.stats, read gas_production.csv, extract the production values for Alberta and BC, and run a Welch's t-test. Print the t-statistic and the p-value.

The LLM will produce something like:

import polars as pl
from scipy import stats

# Convert to consistent units before comparing (10³ m³ → 10⁶ m³)
TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
production = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

alberta = production.filter(pl.col("province") == "AB")["production_mcm"].to_numpy()
bc = production.filter(pl.col("province") == "BC")["production_mcm"].to_numpy()

result = stats.ttest_ind(alberta, bc, equal_var=False)
print(f"t-statistic: {result.statistic:.3f}")
print(f"p-value:     {result.pvalue:.4f}")

if result.pvalue < 0.05:
    print("Result: statistically significant at the 0.05 level.")
else:
    print("Result: not statistically significant at the 0.05 level.")

Run the cell and look at the p-value

What a P-Value Means

What does it actually mean when the p-value is 0.003?

A p-value is the probability of seeing a difference at least this large if there were truly no difference between the two groups
- p = 0.003 means that if Alberta and BC truly had identical production levels, only 0.3% of random samples would show a difference this large or larger
- It does not mean "there is a 99.7% chance the difference is real"
Rejecting the null hypothesis (the assumption of no difference) at p < 0.05 means "this result is surprising enough that we should take it seriously"
- It does not mean the difference is large enough to matter practically
- It does not mean the measurement was correct
A unit error that inflates one column by 1 000× will produce a very significant p-value
- The difference is real, but you measured the wrong thing

When Significance Misleads

Why can a statistically significant result be meaningless?

Statistical significance tells you the signal is large relative to the noise in your sample
- It says nothing about whether you measured the right quantity
- A perfectly measured wrong quantity can be highly significant
The unit error in this session is an example
- After the error, production appears 1 000 times larger relative to consumption than it should
- That difference is highly statistically significant, and entirely artifactual
Before interpreting a significant p-value, confirm the inputs are correct
- Do the units match? Do the columns contain what you think they contain?
- Has someone checked a known row by hand?

Check Understanding

The LLM produces a production-to-consumption ratio of 0.001 for Alberta. You expected a value around 5 (Alberta exports far more than it consumes). What unit conversion did the LLM get backwards, and what is the correct ratio?

The LLM divided thousands of m³ of production by millions of m³ of consumption without converting. 1 million m³ = 1000 thousand m³, so consumption in millions is 1000 times larger than the same value in thousands. To get the correct ratio, first convert production from 10³ m³ to 10⁶ m³ by dividing by 1000, then divide production by consumption. If Alberta produced roughly 163 000 in 10³ m³ units (163 million m³) and consumed 20 million m³, the correct ratio is approximately 8.

A t-test returns p = 0.0004. A classmate says "this proves Alberta and BC have different production levels." Correct their interpretation in one or two sentences.

p = 0.0004 means that if Alberta and BC truly had identical production levels, only 0.04% of random samples would show a difference this large. It does not prove they are different. It says the observed difference is very unlikely under the null hypothesis, and it says nothing about whether the units were correct or whether the difference is large enough to be practically meaningful.

You check a single row by hand and find the LLM's computed ratio for 2020 is 0.0081, but your manual calculation gives 8.1. List the steps you would take before re-running the analysis.

First, confirm your manual calculation by re-deriving the 2020 values from the published CER tables and recomputing the ratio. Second, ask the LLM which columns it used and what units it assumed for each. Third, check whether the code reads the correct rows for 2020, as there may be summary rows or header rows the LLM included. Fourth, re-run the corrected code on only the 2020 row to confirm the output matches your manual result before running on the full dataset.

After fixing the unit error, you re-run the t-test comparing Alberta and BC production and get p = 0.21. The original (incorrect) analysis gave p < 0.001. What does this change tell you?

The original highly significant result was driven entirely by the unit error, which made BC and Alberta production appear 1000 times more different than they actually are. After correcting the units, the actual year-to-year variation in production within each province is large enough that the difference between provinces is not statistically distinguishable from chance. This shows why significance alone is not enough: the first analysis produced a convincingly significant result that was measuring the wrong thing.

Exercises

Three-Province Comparison

Add Saskatchewan to the comparison. Ask the LLM to compute mean natural gas production for Alberta, BC, and Saskatchewan in the same unit. Verify your result against a published CER summary table.

Ask About Assumptions

Take any code from an earlier session and paste it into a new prompt: "What assumptions did you make about the data types and units of each column?" Compare the LLM's answer to the actual column types in the data.

Sensitivity Analysis

For the Alberta vs. BC comparison, vary the significance threshold from 0.01 to 0.10 in steps of 0.01. At which threshold does the corrected analysis first become significant? What does this tell you about using a single fixed threshold?

Find the Published Figure

The CER publishes annual energy market reports. Find their stated value for Alberta natural gas production in the most recent year and compare it to your computed total. Do they match?

Units Not Converted

The following code computes Alberta's production-to-consumption ratio, but the result is off by a factor of 1,000 because the two columns use different units. Work with an LLM to identify which column needs to be converted and fix the calculation.

import polars as pl

production = pl.read_csv("gas_production_tcm.csv")   # units: 10³ m³ (thousands)
consumption = pl.read_csv("gas_consumption_mcm.csv")  # units: 10⁶ m³ (millions)

ab_prod = production.filter(pl.col("province") == "AB")["production"].mean()
ab_cons = consumption.filter(pl.col("province") == "AB")["consumption"].mean()

ratio = ab_prod / ab_cons
print(f"Alberta mean annual production: {ab_prod:.1f}")
print(f"Alberta mean annual consumption: {ab_cons:.1f}")
print(f"Production-to-consumption ratio: {ratio:.4f}")

How do you know the fix worked?

Alberta is a major natural gas exporter, so its ratio should be well above 1, around 5 to 10. If the result is in the thousandths or in the thousands, the unit conversion is still wrong. Check one year's production value against a published CER figure to confirm the units.

One-Tailed Test When Two-Tailed Is Needed

The following code runs a t-test to compare Alberta and BC production, but uses alternative="greater" rather than testing for any difference. Work with an LLM to explain what this assumption means and correct the test.

import polars as pl
from scipy import stats

TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
production = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

alberta = production.filter(pl.col("province") == "AB")["production_mcm"].to_numpy()
bc = production.filter(pl.col("province") == "BC")["production_mcm"].to_numpy()

result = stats.ttest_ind(alberta, bc, equal_var=False, alternative="greater")
print(f"t-statistic: {result.statistic:.3f}")
print(f"p-value:     {result.pvalue:.4f}")

if result.pvalue < 0.05:
    print("Result: statistically significant at the 0.05 level.")
else:
    print("Result: not statistically significant at the 0.05 level.")

How do you know the fix worked?

The two-sided p-value is approximately twice the one-sided p-value when the test statistic is positive. Confirm this relationship holds by comparing the outputs of both versions. Also describe in one sentence what assumption alternative="greater" makes that may not be justified.

Province Names Do Not Match

The following code runs a t-test but prints zero rows for both provinces, producing a NaN t-statistic. Work with an LLM to find why the filter returns no data and fix the province names.

import polars as pl
from scipy import stats

TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
production = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

print(f"Distinct province values: {sorted(production['province'].unique().to_list())}")

alberta = production.filter(pl.col("province") == "Alberta")["production_mcm"].to_numpy()
bc = production.filter(pl.col("province") == "British Columbia")["production_mcm"].to_numpy()

print(f"Alberta rows: {len(alberta)}, BC rows: {len(bc)}")

result = stats.ttest_ind(alberta, bc, equal_var=False)
print(f"t-statistic: {result.statistic:.3f}")
print(f"p-value:     {result.pvalue:.4f}")

How do you know the fix worked?

The code already prints the distinct province values in the data. After fixing, the Alberta and BC arrays should each have one value per year, and the t-statistic should be a real number.

Adding Effect Size

The following code runs a t-test and reports whether the result is statistically significant. Work with an LLM to extend it to also compute and print Cohen's d, along with a label indicating whether the effect size is small, medium, or large.

import polars as pl
import numpy as np
from scipy import stats

TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
production = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

alberta = production.filter(pl.col("province") == "AB")["production_mcm"].to_numpy()
bc = production.filter(pl.col("province") == "BC")["production_mcm"].to_numpy()

result = stats.ttest_ind(alberta, bc, equal_var=False)
print(f"t-statistic: {result.statistic:.3f}")
print(f"p-value:     {result.pvalue:.4f}")
# TODO: compute Cohen's d = (mean_alberta - mean_bc) / pooled_std,
# where pooled_std = sqrt((std_alberta**2 + std_bc**2) / 2),
# and print it alongside an interpretation (small < 0.2, medium 0.2–0.8, large > 0.8)

How do you know the addition is correct?

Compute Cohen's d by hand using the means and standard deviations already available and confirm it matches the printed value. A very large or very small d that contradicts the t-test result suggests a calculation error.