Reproducibility

Goals

The Replication Crisis

Why does replication matter for data science?

What Reproducibility Requires

What does an analysis need in order to be reproducible?

i
import sys
import polars as pl
import altair as alt

print(f"Python:  {sys.version}")
print(f"Polars:  {pl.__version__}")
print(f"Altair:  {alt.__version__}")

Recreating the Alert CO2 Curve

Prompt the LLM to write a complete analysis that recreates the Alert Station CO2 curve.

or

The file alert_co2_monthly.csv has no column headers. The columns in order are year, month, decimal date, monthly average CO2 in ppm, deseasonalized CO2, number of days, standard deviation, and uncertainty. Missing values are -999.99. Using Polars and Altair, read the file, drop missing values, plot monthly average CO2 over decimal date as a line chart, and save it as alert_co2.png.

i
import polars as pl
import altair as alt

# Column order from ECCC Alert Station readme:
# year, month, decimal_date, average, deseasonalized,
# ndays, std_dev, uncertainty. Missing values are -999.99.
COLUMN_NAMES = [
    "year", "month", "decimal_date", "average",
    "deseasonalized", "ndays", "std_dev", "uncertainty",
]

df = pl.read_csv(
    "alert_co2_monthly.csv",
    comment_prefix="#",
    has_header=False,
    new_columns=COLUMN_NAMES,
    null_values=["-999.99", "-1"],
)
df = df.drop_nulls(subset=["average"])

chart = (
    alt.Chart(df)
    .mark_line(color="steelblue", strokeWidth=1)
    .encode(
        x=alt.X("decimal_date:Q", title="Year"),
        y=alt.Y("average:Q", title="CO\u2082 Concentration (ppm)"),
        tooltip=["year", "month", "average"],
    )
    .properties(
        title="Atmospheric CO\u2082 at Alert Station, Nunavut",
        width=700,
        height=350,
    )
)
chart.save("alert_co2.png")
print(f"Saved alert_co2.png")
print(f"Rows: {len(df)}, Date range: {df['year'].min()}\u2013{df['year'].max()}")
print(f"CO\u2082 range: {df['average'].min():.1f}\u2013{df['average'].max():.1f} ppm")

Comparing to the Published Figure

How close does our chart come to the ECCC published figure?

Running the Notebook Again

How do I confirm the analysis is truly reproducible?

i
import polars as pl

COLUMN_NAMES = [
    "year", "month", "decimal_date", "average",
    "deseasonalized", "ndays", "std_dev", "uncertainty",
]

df = pl.read_csv(
    "alert_co2_monthly.csv",
    comment_prefix="#",
    has_header=False,
    new_columns=COLUMN_NAMES,
    null_values=["-999.99", "-1"],
)
df = df.drop_nulls(subset=["average"])

print(f"Rows after dropping missing: {len(df)}")
print(f"First year: {df['year'].min()}, Last year: {df['year'].max()}")
print(f"Most recent monthly average: {df['average'].tail(1).item():.2f} ppm")

Sharing the Notebook

What does a collaborator need to re-run this analysis?

Check Understanding

You share a notebook with a colleague. They install the packages fresh and run all cells, but the Altair chart looks slightly different from yours: the axis labels are in a different font and the line is a slightly different shade of blue. Is the analysis reproducible?

Yes, if the underlying numbers are the same. Visual rendering details like fonts and colour shades can vary between operating systems, browser versions, and Altair versions without affecting the data or computations. The relevant test is whether the plotted CO2 values at each date are identical. If the colleague's chart shows the same curve with the same x and y values, the analysis is reproducible even if the appearance differs slightly.

You download the Alert CO2 file in January 2025 and run the analysis. A colleague downloads the same file in March 2025 and gets a chart that extends two months further. Is either analysis wrong?

Neither is wrong. ECCC updates the file periodically with new measurements. The two analyses are each reproducible from their respective data files, but they are not identical because the input data changed. To ensure your colleague gets the same chart, share your copy of the data file alongside the notebook, not just a link to the ECCC download page.

You restart the kernel and run all cells, but cell 5 crashes with "NameError: name 'df' is not defined." Cell 3 defines df. What is the most likely cause?

Cell 5 refers to a variable defined in cell 3, which should have run first. But cell 3 may have been skipped, crashed silently, or a previous run left df defined in memory so the notebook appeared to work until the kernel was restarted. The fix is to ensure cell 3 runs without error before cell 5. The lesson is that notebooks that only work when run in a particular partial order are not truly reproducible.

A paper claims its results are reproducible because the code is available on GitHub. A reviewer finds that the code uses a hardcoded path like /Users/alice/Desktop/data.csv. Is the paper's claim correct?

No. The hardcoded path works only on the original author's computer. Anyone else who runs the code will get a FileNotFoundError unless they happen to have a file at exactly that path. Reproducibility requires that the path either be relative to the notebook location (so the data file can be placed next to the notebook) or configurable by the user. The reviewer is right to flag this as a reproducibility failure.

Exercises

Record the Environment

Add a cell at the top of your Alert CO2 notebook that prints the version of Python, Polars, and Altair being used. Share the notebook with a classmate. Ask them to run it and compare the version numbers to yours.

Add a Trend Line

Ask the LLM to add a linear trend line to the Alert CO2 chart. What is the slope in ppm per year? Compare your slope to the rate of increase reported in the ECCC documentation.

Seasonal Amplitude

The seasonal oscillation in the Alert CO2 record gets slightly larger over time. Ask the LLM to compute the amplitude (maximum minus minimum CO2 within each year) for each year. Plot the amplitude over time and describe the trend.

Two-Run Comparison

Run the notebook twice, saving the output chart each time as alert_co2_run1.png and alert_co2_run2.png. Ask the LLM to write code that checks whether the two files are identical pixel-for-pixel. Are they?

Hardcoded Path

The following code reads the Alert CO2 data from an absolute path that only works on one computer. Work with an LLM to replace it with a relative path so the script runs on any machine where the data file sits next to the script.

i
import polars as pl
import altair as alt

COLUMN_NAMES = [
    "year", "month", "decimal_date", "average",
    "deseasonalized", "ndays", "std_dev", "uncertainty",
]

df = pl.read_csv(
    "/Users/alice/Desktop/alert_co2_monthly.csv",
    comment_prefix="#",
    has_header=False,
    new_columns=COLUMN_NAMES,
    null_values=["-999.99", "-1"],
)
df = df.drop_nulls(subset=["average"])

print(f"Rows: {len(df)}, CO\u2082 range: {df['average'].min():.1f}\u2013{df['average'].max():.1f} ppm")
How do you know the fix worked?

Move the data file and script to a different folder and run the script from there. If it runs without a FileNotFoundError, the path is now relative and portable. A classmate should be able to run the script on their own machine without editing it.

Sentinel Values Not Removed

The following code reads the CO2 data but forgets to treat -999.99 as missing, so the chart shows a dramatic downward spike that is not in the published figure. Work with an LLM to add the missing filter and fix the chart.

i
import polars as pl
import altair as alt

COLUMN_NAMES = [
    "year", "month", "decimal_date", "average",
    "deseasonalized", "ndays", "std_dev", "uncertainty",
]

df = pl.read_csv(
    "alert_co2_monthly.csv",
    comment_prefix="#",
    has_header=False,
    new_columns=COLUMN_NAMES,
    null_values=[""],
)

print(f"Rows: {len(df)}")
print(f"CO\u2082 min: {df['average'].min():.2f}, max: {df['average'].max():.2f}")

chart = (
    alt.Chart(df.drop_nulls(subset=["average"]))
    .mark_line(color="steelblue", strokeWidth=1)
    .encode(
        x=alt.X("decimal_date:Q", title="Year"),
        y=alt.Y("average:Q", title="CO\u2082 Concentration (ppm)"),
    )
    .properties(title="Atmospheric CO\u2082 at Alert Station", width=700, height=350)
)
chart.save("alert_co2.png")
print("Saved alert_co2.png")
How do you know the fix worked?

After fixing, the printed CO2 minimum should be around 330 ppm (the 1975 value), not -999.99. The chart should show a smooth rising curve with a seasonal oscillation and no downward spikes.

Using a Variable Before It Is Defined

The following script crashes with a NameError on its first print statement. Work with an LLM to explain why the error happens and reorder the lines to fix it.

i
import polars as pl

COLUMN_NAMES = [
    "year", "month", "decimal_date", "average",
    "deseasonalized", "ndays", "std_dev", "uncertainty",
]

# This line uses df_clean before it is defined below.
print(f"CO\u2082 range: {df_clean['average'].min():.1f}\u2013{df_clean['average'].max():.1f} ppm")

df = pl.read_csv(
    "alert_co2_monthly.csv",
    comment_prefix="#",
    has_header=False,
    new_columns=COLUMN_NAMES,
    null_values=["-999.99", "-1"],
)
df_clean = df.drop_nulls(subset=["average"])
print(f"Rows after removing missing values: {len(df_clean)}")
How do you know the fix worked?

The script should run from top to bottom without any errors. After fixing, the CO2 range printed at the top should match the range computed at the bottom. This error is the script-level equivalent of running notebook cells out of order.

Recording the Environment

The following code reads and summarises the CO2 data but does not record which versions of Python, Polars, and Altair were used. Work with an LLM to extend it to print those version numbers.

i
import polars as pl
import altair as alt

COLUMN_NAMES = [
    "year", "month", "decimal_date", "average",
    "deseasonalized", "ndays", "std_dev", "uncertainty",
]

df = pl.read_csv(
    "alert_co2_monthly.csv",
    comment_prefix="#",
    has_header=False,
    new_columns=COLUMN_NAMES,
    null_values=["-999.99", "-1"],
)
df = df.drop_nulls(subset=["average"])

print(f"Rows: {len(df)}, Date range: {df['year'].min()}\u2013{df['year'].max()}")
print(f"CO\u2082 range: {df['average'].min():.1f}\u2013{df['average'].max():.1f} ppm")
# TODO: print the versions of Python, Polars, and Altair being used,
# so that anyone re-running this analysis can confirm they have the same environment
How do you know the addition is correct?

Run pip show polars altair in the terminal and compare the versions it reports to the ones your script prints. They should match exactly.