Checking Your Work

Goals

The Unit Error

What happens when the LLM silently uses the wrong units?

i
import polars as pl

production = pl.read_csv("gas_production_tcm.csv")   # units: 10³ m³ (thousands)
consumption = pl.read_csv("gas_consumption_mcm.csv")  # units: 10⁶ m³ (millions)

# BUG: treats 10³ m³ and 10⁶ m³ as the same unit.
# The ratio will be off by a factor of 1 000.
ratio = production["production"].mean() / consumption["consumption"].mean()
print(f"Production-to-consumption ratio: {ratio:.4f}")
print("(This number is wrong if the units have not been converted.)")

Checking Without Reading Code

What strategies let me catch this error without reading the code?

The Correct Comparison

Fix the code so both columns use the same unit before comparing.

or

The production column is in 10³ m³ (thousands of cubic metres) and the consumption column is in 10⁶ m³ (millions of cubic metres). Convert the production column to 10⁶ m³ by dividing by 1000 before computing the ratio.

i
import polars as pl

# 1 × 10⁶ m³ = 1 000 × 10³ m³
# Divide production (10³ m³) by 1 000 to convert to 10⁶ m³ before comparing.
TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
consumption = pl.read_csv("gas_consumption_mcm.csv")

production_mcm = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

ratio = production_mcm["production_mcm"].mean() / consumption["consumption"].mean()
print(f"Production (mean, 10⁶ m³): {production_mcm['production_mcm'].mean():.1f}")
print(f"Consumption (mean, 10⁶ m³): {consumption['consumption'].mean():.1f}")
print(f"Ratio (production / consumption, same units): {ratio:.2f}")

Running a Hypothesis Test

Run a t-test to check whether natural gas production differs significantly between Alberta and British Columbia.

or

Using Polars and scipy.stats, read gas_production.csv, extract the production values for Alberta and BC, and run a Welch's t-test. Print the t-statistic and the p-value.

i
import polars as pl
from scipy import stats

# Convert to consistent units before comparing (10³ m³ → 10⁶ m³)
TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
production = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

alberta = production.filter(pl.col("province") == "AB")["production_mcm"].to_numpy()
bc = production.filter(pl.col("province") == "BC")["production_mcm"].to_numpy()

result = stats.ttest_ind(alberta, bc, equal_var=False)
print(f"t-statistic: {result.statistic:.3f}")
print(f"p-value:     {result.pvalue:.4f}")

if result.pvalue < 0.05:
    print("Result: statistically significant at the 0.05 level.")
else:
    print("Result: not statistically significant at the 0.05 level.")

What a P-Value Means

What does it actually mean when the p-value is 0.003?

When Significance Misleads

Why can a statistically significant result be meaningless?

Check Understanding

The LLM produces a production-to-consumption ratio of 0.001 for Alberta. You expected a value around 5 (Alberta exports far more than it consumes). What unit conversion did the LLM get backwards, and what is the correct ratio?

The LLM divided thousands of m³ of production by millions of m³ of consumption without converting. 1 million m³ = 1000 thousand m³, so consumption in millions is 1000 times larger than the same value in thousands. To get the correct ratio, first convert production from 10³ m³ to 10⁶ m³ by dividing by 1000, then divide production by consumption. If Alberta produced roughly 163 000 in 10³ m³ units (163 million m³) and consumed 20 million m³, the correct ratio is approximately 8.

A t-test returns p = 0.0004. A classmate says "this proves Alberta and BC have different production levels." Correct their interpretation in one or two sentences.

p = 0.0004 means that if Alberta and BC truly had identical production levels, only 0.04% of random samples would show a difference this large. It does not prove they are different. It says the observed difference is very unlikely under the null hypothesis, and it says nothing about whether the units were correct or whether the difference is large enough to be practically meaningful.

You check a single row by hand and find the LLM's computed ratio for 2020 is 0.0081, but your manual calculation gives 8.1. List the steps you would take before re-running the analysis.

First, confirm your manual calculation by re-deriving the 2020 values from the published CER tables and recomputing the ratio. Second, ask the LLM which columns it used and what units it assumed for each. Third, check whether the code reads the correct rows for 2020, as there may be summary rows or header rows the LLM included. Fourth, re-run the corrected code on only the 2020 row to confirm the output matches your manual result before running on the full dataset.

After fixing the unit error, you re-run the t-test comparing Alberta and BC production and get p = 0.21. The original (incorrect) analysis gave p < 0.001. What does this change tell you?

The original highly significant result was driven entirely by the unit error, which made BC and Alberta production appear 1000 times more different than they actually are. After correcting the units, the actual year-to-year variation in production within each province is large enough that the difference between provinces is not statistically distinguishable from chance. This shows why significance alone is not enough: the first analysis produced a convincingly significant result that was measuring the wrong thing.

Exercises

Three-Province Comparison

Add Saskatchewan to the comparison. Ask the LLM to compute mean natural gas production for Alberta, BC, and Saskatchewan in the same unit. Verify your result against a published CER summary table.

Ask About Assumptions

Take any code from an earlier session and paste it into a new prompt: "What assumptions did you make about the data types and units of each column?" Compare the LLM's answer to the actual column types in the data.

Sensitivity Analysis

For the Alberta vs. BC comparison, vary the significance threshold from 0.01 to 0.10 in steps of 0.01. At which threshold does the corrected analysis first become significant? What does this tell you about using a single fixed threshold?

Find the Published Figure

The CER publishes annual energy market reports. Find their stated value for Alberta natural gas production in the most recent year and compare it to your computed total. Do they match?

Units Not Converted

The following code computes Alberta's production-to-consumption ratio, but the result is off by a factor of 1,000 because the two columns use different units. Work with an LLM to identify which column needs to be converted and fix the calculation.

i
import polars as pl

production = pl.read_csv("gas_production_tcm.csv")   # units: 10³ m³ (thousands)
consumption = pl.read_csv("gas_consumption_mcm.csv")  # units: 10⁶ m³ (millions)

ab_prod = production.filter(pl.col("province") == "AB")["production"].mean()
ab_cons = consumption.filter(pl.col("province") == "AB")["consumption"].mean()

ratio = ab_prod / ab_cons
print(f"Alberta mean annual production: {ab_prod:.1f}")
print(f"Alberta mean annual consumption: {ab_cons:.1f}")
print(f"Production-to-consumption ratio: {ratio:.4f}")
How do you know the fix worked?

Alberta is a major natural gas exporter, so its ratio should be well above 1, around 5 to 10. If the result is in the thousandths or in the thousands, the unit conversion is still wrong. Check one year's production value against a published CER figure to confirm the units.

One-Tailed Test When Two-Tailed Is Needed

The following code runs a t-test to compare Alberta and BC production, but uses alternative="greater" rather than testing for any difference. Work with an LLM to explain what this assumption means and correct the test.

i
import polars as pl
from scipy import stats

TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
production = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

alberta = production.filter(pl.col("province") == "AB")["production_mcm"].to_numpy()
bc = production.filter(pl.col("province") == "BC")["production_mcm"].to_numpy()

result = stats.ttest_ind(alberta, bc, equal_var=False, alternative="greater")
print(f"t-statistic: {result.statistic:.3f}")
print(f"p-value:     {result.pvalue:.4f}")

if result.pvalue < 0.05:
    print("Result: statistically significant at the 0.05 level.")
else:
    print("Result: not statistically significant at the 0.05 level.")
How do you know the fix worked?

The two-sided p-value is approximately twice the one-sided p-value when the test statistic is positive. Confirm this relationship holds by comparing the outputs of both versions. Also describe in one sentence what assumption alternative="greater" makes that may not be justified.

Province Names Do Not Match

The following code runs a t-test but prints zero rows for both provinces, producing a NaN t-statistic. Work with an LLM to find why the filter returns no data and fix the province names.

i
import polars as pl
from scipy import stats

TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
production = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

print(f"Distinct province values: {sorted(production['province'].unique().to_list())}")

alberta = production.filter(pl.col("province") == "Alberta")["production_mcm"].to_numpy()
bc = production.filter(pl.col("province") == "British Columbia")["production_mcm"].to_numpy()

print(f"Alberta rows: {len(alberta)}, BC rows: {len(bc)}")

result = stats.ttest_ind(alberta, bc, equal_var=False)
print(f"t-statistic: {result.statistic:.3f}")
print(f"p-value:     {result.pvalue:.4f}")
How do you know the fix worked?

The code already prints the distinct province values in the data. After fixing, the Alberta and BC arrays should each have one value per year, and the t-statistic should be a real number.

Adding Effect Size

The following code runs a t-test and reports whether the result is statistically significant. Work with an LLM to extend it to also compute and print Cohen's d, along with a label indicating whether the effect size is small, medium, or large.

i
import polars as pl
import numpy as np
from scipy import stats

TCM_TO_MCM = 1 / 1_000.0

production = pl.read_csv("gas_production_tcm.csv")
production = production.with_columns(
    (pl.col("production") * TCM_TO_MCM).alias("production_mcm")
)

alberta = production.filter(pl.col("province") == "AB")["production_mcm"].to_numpy()
bc = production.filter(pl.col("province") == "BC")["production_mcm"].to_numpy()

result = stats.ttest_ind(alberta, bc, equal_var=False)
print(f"t-statistic: {result.statistic:.3f}")
print(f"p-value:     {result.pvalue:.4f}")
# TODO: compute Cohen's d = (mean_alberta - mean_bc) / pooled_std,
# where pooled_std = sqrt((std_alberta**2 + std_bc**2) / 2),
# and print it alongside an interpretation (small < 0.2, medium 0.2–0.8, large > 0.8)
How do you know the addition is correct?

Compute Cohen's d by hand using the means and standard deviations already available and confirm it matches the printed value. A very large or very small d that contradicts the t-test result suggests a calculation error.