Correlation and Prediction

Learning Goals

Lesson

i
"""Correlation and regression for lines vs. functions per file."""

import polars as pl
from scipy import stats

py = pl.read_csv("data/py_func_counts.csv")
js = pl.read_csv("data/js_func_counts.csv")

for lang, df in [("Python", py), ("JavaScript", js)]:
    clean = df.drop_nulls(["lines", "functions"]).filter(pl.col("functions") > 0)
    r, p = stats.pearsonr(clean["lines"].to_numpy(), clean["functions"].to_numpy())
    print(f"{lang}: Pearson r = {r:.3f}, p = {p:.2e}")

# Linear regression on Python data
clean_py = py.drop_nulls(["lines", "functions"]).filter(pl.col("functions") > 0)
x = clean_py["lines"].to_numpy()
y = clean_py["functions"].to_numpy()
slope, intercept, r, p, se = stats.linregress(x, y)
print(
    f"\nPython regression: slope = {slope:.4f}, intercept = {intercept:.2f}, R² = {r**2:.3f}"
)
residuals = y - (slope * x + intercept)
print(f"Residual std: {residuals.std():.2f}")

Check Understanding

If Pearson r = 0.8 between lines-per-file and functions-per-file, can you conclude that longer files cause more functions to be defined?

No. Correlation does not imply causation. Both variables could be driven by a third factor, such as the age of the codebase or the programming style of a particular team. Longer files might accumulate more functions simply because the same developer who writes long files also writes many functions; the file length itself might not cause anything. To make a causal claim you would need to control for confounds or run an experiment.

The following code computes R² incorrectly. What is the bug and how do you fix it?
slope, intercept, r, p, se = stats.linregress(x, y)
print(f"R² = {r:.3f}")

stats.linregress returns the Pearson correlation coefficient r, not R². To get R² you must square it:

slope, intercept, r, p, se = stats.linregress(x, y)
print(f"R² = {r**2:.3f}")

Printing r directly gives a value between -1 and 1 that looks plausible but is wrong. For r = 0.806, the code prints 0.806 instead of the correct R² ≈ 0.650.

A residual plot shows a fan shape: residuals are small for files with few predicted functions and large for files with many predicted functions. What does this indicate about the model?

The fan shape indicates heteroscedasticity: the variance of the prediction error is not constant but grows with the predicted value. A plain linear regression assumes constant variance, so its confidence intervals and standard errors for large files will be too narrow. One common remedy is to log-transform the outcome variable, which often stabilizes variance when the data spans several orders of magnitude.

You are comparing lines-per-file to functions-per-file in a dataset where 90% of files have fewer than 100 lines but a handful have more than 10,000 lines. When would Spearman rank correlation give a more reliable answer than Pearson r?

Pearson r is sensitive to outliers because it is calculated from the raw values. A handful of extremely long files will pull the regression line toward them and inflate r, giving the impression of a stronger or weaker linear relationship than exists for the majority of files. Spearman r replaces raw values with ranks, so those giant files simply become "the largest" rather than values 100 times bigger than everything else. Spearman is more reliable here because it describes the relationship across the full distribution rather than being dominated by the extreme tail.

Exercises

Spearman vs. Pearson

Compute the Spearman rank correlation between lines-per-file and functions-per-file for both the Python and JavaScript datasets using scipy.stats.spearmanr. Compare each Spearman value to the corresponding Pearson r. For which language does the gap between the two measures suggest the relationship is driven by outliers? Describe the characteristics of a file for which Spearman would be substantially more reliable than Pearson, and explain why in terms of what each measure actually computes.

Regression interpretation

Fit a linear regression of functions-per-file on lines-per-file for the Python dataset using scipy.stats.linregress. Report the slope, intercept, and R². Use the model to predict the number of functions in a 500-line file. Then explain in two sentences whether you trust that prediction, considering both the R² value and the range of file sizes in the training data.

In-sample vs. out-of-sample R²

Split the Python dataset in half randomly (use a fixed random seed so your results are reproducible). Fit the linear regression on the first half and compute R² for both halves: once on the data you trained on (in-sample) and once on the data you held out (out-of-sample). Report both values. Explain in two sentences what the difference between them reveals about whether the model is overfitting to the training data.

Large-file sensitivity

Compute Pearson r between lines-per-file and functions-per-file for the Python dataset. Then remove all files with more than 1,000 lines and compute it again on the remaining data. Report both values. Write two sentences explaining whether the correlation is driven by large files and what this implies about using a single r to summarize the relationship across the full dataset.

Heteroscedasticity and transformation

After fitting the linear regression on the Python data, plot the residuals against the predicted values using Altair. Describe in one sentence whether the plot shows a fan shape. If it does, re-fit the model after applying a log transformation to both variables (use pl.col("lines").log() and pl.col("functions").log()) and compare the residual plots before and after. Explain in one sentence why log-transforming both variables can reduce heteroscedasticity in file-size data.