Grouping, Aggregating, and Joining
Learning Goals
- Use
group_byandaggto compute per-group statistics - Join two dataframes on a key column and check for row count changes
- Reshape dataframes with
pivotandunpivot
Lesson
group_bysplits a dataframe into groups sharing the same value in one or more columns, thenaggcomputes a summary for each group- Aggregation functions include
mean,median,std,min,max,count, andn_unique - The result has one row per unique combination of the grouping columns
- Order of rows in the output is not guaranteed; use
sortif you need a specific order
- Aggregation functions include
- You can group by more than one column at a time
- Example:
df.group_by(["language", "year"]).agg(pl.col("lines").median())gives median line count per language per year - Each additional grouping column multiplies the number of output rows
- Example:
- Common aggregation mistake: forgetting that
meanis sensitive to outliers- A few very long functions can pull the mean far above the median
- When in doubt, report both; if they differ substantially, explain why
- A join combines two dataframes by matching rows on one or more key columns
- An inner join keeps only rows where the key appears in both dataframes
- A left join keeps all rows from the left dataframe and fills with null where the right has no match
- Outer joins keep all rows from both sides
- The danger of joining without checking: row counts can explode (many-to-many join) or silently shrink (inner join on a key that many rows do not share)
- Always check
shapebefore and after a join- If the output has more rows than either input, you have a many-to-many relationship that you probably did not intend
- If the output has far fewer rows than either input, many keys did not match, which is worth investigating before proceeding
- A pivot reshapes a dataframe from long format (one row per observation) to
wide format (one row per entity with multiple value columns)
unpivotgoes the other direction: from wide to long- Long format is usually better for analysis; wide format is often more readable for humans
- Polars uses
df.pivot(on="variable", index="id", values="value")anddf.unpivot(on=["col_a", "col_b"], index="id")
- El Haji et al. studied tests generated by GitHub Copilot and found that the tests are
syntactically correct but often lack meaningful assertions [ElHaji2024]
- This is partly a data problem: if test outcomes are stored as unstructured text or log files, you cannot group by assertion type or aggregate pass rates across test suites
- Structuring test result data as a proper dataframe with columns for test name, assertion type, pass/fail, and file name makes exactly this kind of group-and-aggregate analysis straightforward
- Well-structured data enables the analysis; poorly structured data makes it impossible
"""Compare Python and JavaScript function sizes using groupby."""
import polars as pl
py = pl.read_csv("data/py_func_counts.csv").with_columns(
pl.lit("Python").alias("language")
)
js = pl.read_csv("data/js_func_counts.csv").with_columns(
pl.lit("JavaScript").alias("language")
)
combined = pl.concat([py, js])
combined = combined.with_columns(
(pl.col("lines") / pl.col("functions")).alias("lines_per_func")
)
stats = combined.group_by("language").agg(
pl.col("lines_per_func").median().alias("median"),
pl.col("lines_per_func").mean().alias("mean"),
pl.col("lines_per_func").std().alias("std"),
)
print(stats)
Check Understanding
What is the difference between a left join and an inner join? Give an example where they produce different results.
An inner join keeps only rows where the key value appears in both dataframes. A left join keeps all rows from the left dataframe and fills in null for any columns from the right dataframe when no matching key exists. For example, if you have a dataframe of all Python files and a dataframe of only those files that have test coverage data, an inner join would drop Python files with no coverage data, while a left join would keep all Python files and show null coverage for the ones not in the coverage dataframe. Which you want depends on whether you care about the files with no coverage data.
The following code is supposed to extract the mean lines-per-function for Python from a grouped result, but it contains a bug. What is wrong and how do you fix it?
result = (df
.group_by("language")
.agg(pl.col("lines").mean()))
# then later
py_mean = result["language" == "Python"]["lines"]
result = (df
.group_by("language")
.agg(pl.col("lines").mean()))
# then later
py_mean = result["language" == "Python"]["lines"]
The expression result["language" == "Python"] is not a Polars filter: it evaluates
"language" == "Python" as a Python boolean comparison between two strings, which
is False, then tries to use False as an index into the dataframe. This will either
raise an error or return an empty result. The fix is to use filter properly:
py_mean = result.filter(pl.col("language") == "Python")["lines"][0]
Why should you always check shape before and after a join?
A join can silently change the number of rows in ways that are hard to detect later. If the key column has duplicate values in both dataframes (a many-to-many relationship), the output will have more rows than either input, because every matching pair of rows produces a row in the output. If many rows in one dataframe have keys that do not appear in the other, an inner join will drop them without warning. Either situation means your subsequent analysis is working on data with a different shape than you expected. Checking shape before and after catches these problems immediately, before they propagate into aggregations or visualizations.
What would unpivot do to a dataframe with columns file, python_lines, and js_lines, and when would you want that shape?
unpivot would transform the dataframe from wide format (one row per file, two value
columns) to long format (two rows per file, one column for the language name and one
column for the line count). The result would have columns file, variable (containing
the strings python_lines and js_lines), and value (containing the actual counts).
You would want this shape when you need to do a group-by on language, pass the data to a
plotting library that expects a single value column, or apply the same filter to both
languages at once without writing the condition twice.
Exercises
Histogram of Function Sizes
Using the combined Python and JavaScript function-count data from func_counts.py, create
a histogram of lines-per-function for each language using 20-line bins from 0 to 200
lines. Plot both histograms on the same chart using Altair, with language mapped to color.
Write two sentences comparing the shapes of the two distributions: which language has
longer functions on average, and which has a longer tail?
Files in Common
Load the Python and JavaScript function-count datasets and join them on the file name column using an outer join. After the join, count how many files appear in both datasets, how many appear only in the Python dataset, and how many appear only in the JavaScript dataset. Report these three counts and write a sentence explaining what a file appearing in both datasets would imply about the repository being analyzed.
Percentile Table by Language
Compute the 10th, 25th, 75th, and 90th percentiles of lines-per-function for each language as a single Polars dataframe with one row per language and four value columns. Sort by the 50th percentile (median). Identify which language has the largest difference between its 90th and 10th percentiles and write a sentence explaining what that difference tells you about the predictability of function size in that language.
Pivot to Wide Format
Start with the combined long-format dataframe from func_counts.py that has one row per
file per language. Use pivot to produce a wide-format dataframe with one row per file
and two columns: python_lines_per_func and js_lines_per_func. Check the shape of
the result before and after the pivot. Write a sentence explaining what happened to files
that exist in only one language's dataset, and whether that behavior is appropriate for
a comparison study.
Mean Versus Median Ratio
Using the grouped statistics from func_counts.py, compute two ratios: the ratio of
mean lines-per-function in Python to mean lines-per-function in JavaScript, and the ratio
of median lines-per-function in Python to median lines-per-function in JavaScript. Report
both ratios. Write three to four sentences explaining which ratio tells the more useful
story for a researcher studying typical function size, and what the difference between the
two ratios implies about the shape of the distribution.