Grouping, Aggregating, and Joining

Learning Goals

Use group_by and agg to compute per-group statistics
Join two dataframes on a key column and check for row count changes
Reshape dataframes with pivot and unpivot

Lesson

group_by splits a dataframe into groups sharing the same value in one or more columns, then agg computes a summary for each group
- Aggregation functions include mean, median, std, min, max, count, and n_unique
- The result has one row per unique combination of the grouping columns
- Order of rows in the output is not guaranteed; use sort if you need a specific order
You can group by more than one column at a time
- Example: df.group_by(["language", "year"]).agg(pl.col("lines").median()) gives median line count per language per year
- Each additional grouping column multiplies the number of output rows
Common aggregation mistake: forgetting that mean is sensitive to outliers
- A few very long functions can pull the mean far above the median
- When in doubt, report both; if they differ substantially, explain why
A join combines two dataframes by matching rows on one or more key columns
- An inner join keeps only rows where the key appears in both dataframes
- A left join keeps all rows from the left dataframe and fills with null where the right has no match
- Outer joins keep all rows from both sides
- The danger of joining without checking: row counts can explode (many-to-many join) or silently shrink (inner join on a key that many rows do not share)
Always check shape before and after a join
- If the output has more rows than either input, you have a many-to-many relationship that you probably did not intend
- If the output has far fewer rows than either input, many keys did not match, which is worth investigating before proceeding
A pivot reshapes a dataframe from long format (one row per observation) to wide format (one row per entity with multiple value columns)
- unpivot goes the other direction: from wide to long
- Long format is usually better for analysis; wide format is often more readable for humans
- Polars uses df.pivot(on="variable", index="id", values="value") and df.unpivot(on=["col_a", "col_b"], index="id")
El Haji et al. studied tests generated by GitHub Copilot and found that the tests are syntactically correct but often lack meaningful assertions [ElHaji2024]
- This is partly a data problem: if test outcomes are stored as unstructured text or log files, you cannot group by assertion type or aggregate pass rates across test suites
- Structuring test result data as a proper dataframe with columns for test name, assertion type, pass/fail, and file name makes exactly this kind of group-and-aggregate analysis straightforward
- Well-structured data enables the analysis; poorly structured data makes it impossible

"""Compare Python and JavaScript function sizes using groupby."""

import polars as pl

py = pl.read_csv("data/py_func_counts.csv").with_columns(
    pl.lit("Python").alias("language")
)
js = pl.read_csv("data/js_func_counts.csv").with_columns(
    pl.lit("JavaScript").alias("language")
)
combined = pl.concat([py, js])
combined = combined.with_columns(
    (pl.col("lines") / pl.col("functions")).alias("lines_per_func")
)
stats = combined.group_by("language").agg(
    pl.col("lines_per_func").median().alias("median"),
    pl.col("lines_per_func").mean().alias("mean"),
    pl.col("lines_per_func").std().alias("std"),
)
print(stats)

Check Understanding

What is the difference between a left join and an inner join? Give an example where they produce different results.

An inner join keeps only rows where the key value appears in both dataframes. A left join keeps all rows from the left dataframe and fills in null for any columns from the right dataframe when no matching key exists. For example, if you have a dataframe of all Python files and a dataframe of only those files that have test coverage data, an inner join would drop Python files with no coverage data, while a left join would keep all Python files and show null coverage for the ones not in the coverage dataframe. Which you want depends on whether you care about the files with no coverage data.

The following code is supposed to extract the mean lines-per-function for Python from a grouped result, but it contains a bug. What is wrong and how do you fix it?

result = (df
          .group_by("language")
          .agg(pl.col("lines").mean()))
# then later
py_mean = result["language" == "Python"]["lines"]

The expression result["language" == "Python"] is not a Polars filter: it evaluates "language" == "Python" as a Python boolean comparison between two strings, which is False, then tries to use False as an index into the dataframe. This will either raise an error or return an empty result. The fix is to use filter properly:

py_mean = result.filter(pl.col("language") == "Python")["lines"][0]

Why should you always check shape before and after a join?

A join can silently change the number of rows in ways that are hard to detect later. If the key column has duplicate values in both dataframes (a many-to-many relationship), the output will have more rows than either input, because every matching pair of rows produces a row in the output. If many rows in one dataframe have keys that do not appear in the other, an inner join will drop them without warning. Either situation means your subsequent analysis is working on data with a different shape than you expected. Checking shape before and after catches these problems immediately, before they propagate into aggregations or visualizations.

What would unpivot do to a dataframe with columns file, python_lines, and js_lines, and when would you want that shape?

unpivot would transform the dataframe from wide format (one row per file, two value columns) to long format (two rows per file, one column for the language name and one column for the line count). The result would have columns file, variable (containing the strings python_lines and js_lines), and value (containing the actual counts). You would want this shape when you need to do a group-by on language, pass the data to a plotting library that expects a single value column, or apply the same filter to both languages at once without writing the condition twice.

Exercises

Histogram of Function Sizes

Using the combined Python and JavaScript function-count data from func_counts.py, create a histogram of lines-per-function for each language using 20-line bins from 0 to 200 lines. Plot both histograms on the same chart using Altair, with language mapped to color. Write two sentences comparing the shapes of the two distributions: which language has longer functions on average, and which has a longer tail?

Files in Common

Load the Python and JavaScript function-count datasets and join them on the file name column using an outer join. After the join, count how many files appear in both datasets, how many appear only in the Python dataset, and how many appear only in the JavaScript dataset. Report these three counts and write a sentence explaining what a file appearing in both datasets would imply about the repository being analyzed.

Percentile Table by Language

Compute the 10th, 25th, 75th, and 90th percentiles of lines-per-function for each language as a single Polars dataframe with one row per language and four value columns. Sort by the 50th percentile (median). Identify which language has the largest difference between its 90th and 10th percentiles and write a sentence explaining what that difference tells you about the predictability of function size in that language.

Pivot to Wide Format

Start with the combined long-format dataframe from func_counts.py that has one row per file per language. Use pivot to produce a wide-format dataframe with one row per file and two columns: python_lines_per_func and js_lines_per_func. Check the shape of the result before and after the pivot. Write a sentence explaining what happened to files that exist in only one language's dataset, and whether that behavior is appropriate for a comparison study.

Mean Versus Median Ratio

Using the grouped statistics from func_counts.py, compute two ratios: the ratio of mean lines-per-function in Python to mean lines-per-function in JavaScript, and the ratio of median lines-per-function in Python to median lines-per-function in JavaScript. Report both ratios. Write three to four sentences explaining which ratio tells the more useful story for a researcher studying typical function size, and what the difference between the two ratios implies about the shape of the distribution.