Better Prompts

Goals

Why Vague Prompts Fail

What goes wrong when I give an LLM a vague prompt?

i
import polars as pl

df = pl.read_csv("npri_data.csv", null_values=[""])

# The LLM invented "substance" and "lead_threshold" -- neither exists.
# This code will crash with KeyError: 'substance'
dangerous = df.filter(
    (pl.col("substance") == "lead") & (pl.col("lead_threshold") > 10)
)
print(f"Facilities releasing dangerous amounts of lead: {len(dangerous)}")

Anatomy of a Good Prompt

What information does a prompt need to include to get useful code?

i
import polars as pl

df = pl.read_csv("npri_data.csv", null_values=[""])
print("Column names:")
print(df.columns)
print("\nFirst row:")
print(df.head(1))
print("\nDistinct substance names (first 20):")
print(df["Substance_Name_English"].unique().sort().head(20))

Computing a Proportion

What proportion of facilities released more than one tonne of lead compounds in the most recent year?

i
import polars as pl

LEAD_THRESHOLD = 1.0  # tonnes; NPRI reporting threshold for lead (and its compounds)
LEAD_SUBSTANCE = "Lead (and its compounds)"

df = pl.read_csv("npri_data.csv", null_values=[""])

most_recent_year = df["Reporting_Year"].max()
df_year = df.filter(pl.col("Reporting_Year") == most_recent_year)
df_lead = df_year.filter(pl.col("Substance_Name_English") == LEAD_SUBSTANCE)
df_lead = df_lead.drop_nulls(subset=["Total_Released_Tonnes"])

total = len(df_lead)
exceeds = df_lead.filter(pl.col("Total_Released_Tonnes") > LEAD_THRESHOLD).height
proportion = exceeds / total

print(f"Year: {most_recent_year}")
print(f"Facilities reporting lead with release data: {total}")
print(f"Facilities releasing more than {LEAD_THRESHOLD} tonne: {exceeds}")
print(f"Proportion: {proportion:.3f} (about 1 in {round(1 / proportion)})")

Probability as a Proportion

What does it mean to say the probability of exceeding the threshold is 0.059?

Verifying by Hand

How do I check that the code produced the right answer?

i
import polars as pl

LEAD_THRESHOLD = 1.0
LEAD_SUBSTANCE = "Lead (and its compounds)"

df = pl.read_csv("npri_data.csv", null_values=[""])
most_recent_year = df["Reporting_Year"].max()

sample = (
    df.filter(
        (pl.col("Reporting_Year") == most_recent_year)
        & (pl.col("Substance_Name_English") == LEAD_SUBSTANCE)
    )
    .drop_nulls(subset=["Total_Released_Tonnes"])
    .select(["Facility_Name", "Province_Territory", "Total_Released_Tonnes"])
    .sort("Total_Released_Tonnes", descending=True)
    .head(20)
)
print(sample)
print(f"\nRows in this sample above {LEAD_THRESHOLD} tonne: "
      f"{sample.filter(pl.col('Total_Released_Tonnes') > LEAD_THRESHOLD).height}")

Check Understanding

You paste the prompt: "Compute the average lead released." The LLM produces code that reads df["lead_tonnes"].mean() but your column is named Total_Released_Tonnes. The code crashes. Rewrite the prompt so this error cannot happen.

Include the exact column names and a sample row: "The dataset has columns NPRI_ID, Facility_Name, Province_Territory, Substance_Name_English, Total_Released_Tonnes. Using Polars, filter to rows where Substance_Name_English is 'Lead (and its compounds)' and compute the mean of Total_Released_Tonnes, excluding nulls." With the exact column name in the prompt, the LLM cannot invent an alternative.

The code filters to Substance_Name_English == "Lead (and its compounds)" and returns 0 rows. You can see lead entries in the first few rows of the file. What are two likely causes?

First, the substance name in the file might use different capitalisation or punctuation (for example, "Lead and its compounds" without parentheses, or "lead (and its compounds)" in lowercase). Print df["Substance_Name_English"].unique() and search for rows containing "Lead" to find the exact string. Second, the filter may be applied before the year filter, so the correct rows exist in the full data but are excluded by an earlier step. Print the dataframe at each step to find where the rows disappear.

The code returns a proportion of 0.031. You check the first 20 lead-reporting rows and find 2 above 1 tonne. Is the code consistent with your check? Show your reasoning.

2 out of 20 is 0.10, which is much higher than 0.031. This discrepancy is suspicious. Possible causes: the first 20 rows may be sorted by quantity descending, so they overrepresent high-release facilities; the LLM may have filtered to a different set of rows for the proportion than for the sample; or the denominator may be different (perhaps it counted all facilities, not just lead-reporting ones). Expand the hand-check to a random sample of rows rather than the first 20.

A report says "6% of facilities reporting to NPRI released more than 1 tonne of lead compounds." Someone reads this and says "my city's factory has a 6% chance of being a lead emitter." What is wrong with this interpretation?

The 6% is a proportion across all NPRI-reporting facilities, which are a specific, non-random group (facilities large enough to meet reporting thresholds). A factory that does not meet any NPRI reporting threshold is not in the denominator at all. The proportion from a non-random, self-selected reporting group cannot be applied as a probability to a specific facility that may or may not be in the group.

Exercises

Provincial Breakdown

Ask the LLM to compute the proportion of facilities exceeding 1 tonne separately for each province. Which provinces have the highest proportions? Is there a pattern you would expect given Canada's industrial geography?

Multiple Substances

The NPRI tracks many pollutants. Ask the LLM to compute the proportion of facilities exceeding reporting thresholds for mercury (Hg, threshold 5 kg) and arsenic (As, threshold 1 tonne) in addition to lead. Compare the three proportions.

Trend Over Time

Ask the LLM to compute the proportion of facilities reporting lead releases above 1 tonne for each year in the dataset. Has the proportion changed over time?

Re-prompt After Failure

Give the LLM the vague prompt "How many facilities released dangerous amounts of lead?" and record what it produces. Then give it the well-structured prompt from this session and compare the two results. Write one sentence describing what changed between the two outputs.

Threshold Set Too High

The following code computes the proportion of lead-reporting facilities that exceeded a threshold, but the proportion is much smaller than the figure reported in published NPRI summaries. Work with an LLM to find the wrong constant and correct it.

i
import polars as pl

LEAD_THRESHOLD = 10.0  # tonnes
LEAD_SUBSTANCE = "Lead (and its compounds)"

df = pl.read_csv("npri_data.csv", null_values=[""])

most_recent_year = df["Reporting_Year"].max()
df_year = df.filter(pl.col("Reporting_Year") == most_recent_year)
df_lead = df_year.filter(pl.col("Substance_Name_English") == LEAD_SUBSTANCE)
df_lead = df_lead.drop_nulls(subset=["Total_Released_Tonnes"])

total = len(df_lead)
exceeds = df_lead.filter(pl.col("Total_Released_Tonnes") > LEAD_THRESHOLD).height
proportion = exceeds / total

print(f"Year: {most_recent_year}")
print(f"Facilities reporting lead: {total}")
print(f"Facilities exceeding threshold: {exceeds}")
print(f"Proportion: {proportion:.3f}")
How do you know the fix worked?

The corrected proportion should be close to the figure in the NPRI annual summary for the same year. Also check that the constant name LEAD_THRESHOLD matches the value described in the NPRI documentation.

Substance Name Does Not Match

The following code filters to lead-reporting facilities but returns zero rows. Work with an LLM to explain why the filter finds nothing and fix the string.

i
import polars as pl

LEAD_THRESHOLD = 1.0  # tonnes
LEAD_SUBSTANCE = "lead (and its compounds)"

df = pl.read_csv("npri_data.csv", null_values=[""])

most_recent_year = df["Reporting_Year"].max()
df_year = df.filter(pl.col("Reporting_Year") == most_recent_year)
df_lead = df_year.filter(pl.col("Substance_Name_English") == LEAD_SUBSTANCE)

print(f"Rows matching substance filter: {len(df_lead)}")
print(f"Distinct substances in full dataset (sample):")
print(df["Substance_Name_English"].unique().sort().head(10))
How do you know the fix worked?

After fixing, the row count should be greater than zero. Print df["Substance_Name_English"].unique() and search for "lead" to find the exact string as it appears in the file, then confirm the constant matches it character for character.

Wrong Denominator

The following code computes the proportion of facilities exceeding the lead threshold, but uses the total number of NPRI-reporting facilities as the denominator rather than just the facilities that reported lead. Work with an LLM to identify which denominator is correct and fix the calculation.

i
import polars as pl

LEAD_THRESHOLD = 1.0  # tonnes
LEAD_SUBSTANCE = "Lead (and its compounds)"

df = pl.read_csv("npri_data.csv", null_values=[""])

most_recent_year = df["Reporting_Year"].max()
df_year = df.filter(pl.col("Reporting_Year") == most_recent_year)
df_lead = df_year.filter(pl.col("Substance_Name_English") == LEAD_SUBSTANCE)
df_lead = df_lead.drop_nulls(subset=["Total_Released_Tonnes"])

all_facilities = df_year["NPRI_ID"].n_unique()
exceeds = df_lead.filter(pl.col("Total_Released_Tonnes") > LEAD_THRESHOLD).height
proportion = exceeds / all_facilities

print(f"Year: {most_recent_year}")
print(f"All facilities this year: {all_facilities}")
print(f"Lead-reporting facilities: {len(df_lead)}")
print(f"Facilities exceeding {LEAD_THRESHOLD} t: {exceeds}")
print(f"Proportion (of all facilities): {proportion:.4f}")
How do you know the fix worked?

The code already prints both all_facilities and len(df_lead). Compute both proportions by hand and decide which one answers the question "of facilities that reported lead, what fraction exceeded the threshold?"

Print the Evidence

The following code prints a proportion but not the numerator or denominator, so there is no way to verify the arithmetic by eye. Work with an LLM to extend it to also print the count of facilities exceeding the threshold and the total count of lead-reporting facilities.

i
import polars as pl

LEAD_THRESHOLD = 1.0  # tonnes
LEAD_SUBSTANCE = "Lead (and its compounds)"

df = pl.read_csv("npri_data.csv", null_values=[""])

most_recent_year = df["Reporting_Year"].max()
df_lead = (
    df.filter(pl.col("Reporting_Year") == most_recent_year)
    .filter(pl.col("Substance_Name_English") == LEAD_SUBSTANCE)
    .drop_nulls(subset=["Total_Released_Tonnes"])
)

total = len(df_lead)
exceeds = df_lead.filter(pl.col("Total_Released_Tonnes") > LEAD_THRESHOLD).height
proportion = exceeds / total

print(f"Proportion exceeding {LEAD_THRESHOLD} t: {proportion:.3f}")
# TODO: also print exceeds and total so the reader can verify
# that proportion = exceeds / total
How do you know the addition is correct?

Divide the two new numbers yourself and confirm the result matches the printed proportion. If exceeds / total does not equal the proportion shown, there is a bug in the original computation.