Grouping, Aggregating, and Joining

Learning Goals

Lesson

i
"""Compare Python and JavaScript function sizes using groupby."""

import polars as pl

py = pl.read_csv("data/py_func_counts.csv").with_columns(
    pl.lit("Python").alias("language")
)
js = pl.read_csv("data/js_func_counts.csv").with_columns(
    pl.lit("JavaScript").alias("language")
)
combined = pl.concat([py, js])
combined = combined.with_columns(
    (pl.col("lines") / pl.col("functions")).alias("lines_per_func")
)
stats = combined.group_by("language").agg(
    pl.col("lines_per_func").median().alias("median"),
    pl.col("lines_per_func").mean().alias("mean"),
    pl.col("lines_per_func").std().alias("std"),
)
print(stats)

Check Understanding

What is the difference between a left join and an inner join? Give an example where they produce different results.

An inner join keeps only rows where the key value appears in both dataframes. A left join keeps all rows from the left dataframe and fills in null for any columns from the right dataframe when no matching key exists. For example, if you have a dataframe of all Python files and a dataframe of only those files that have test coverage data, an inner join would drop Python files with no coverage data, while a left join would keep all Python files and show null coverage for the ones not in the coverage dataframe. Which you want depends on whether you care about the files with no coverage data.

The following code is supposed to extract the mean lines-per-function for Python from a grouped result, but it contains a bug. What is wrong and how do you fix it?
result = (df
          .group_by("language")
          .agg(pl.col("lines").mean()))
# then later
py_mean = result["language" == "Python"]["lines"]

The expression result["language" == "Python"] is not a Polars filter: it evaluates "language" == "Python" as a Python boolean comparison between two strings, which is False, then tries to use False as an index into the dataframe. This will either raise an error or return an empty result. The fix is to use filter properly:

py_mean = result.filter(pl.col("language") == "Python")["lines"][0]
Why should you always check shape before and after a join?

A join can silently change the number of rows in ways that are hard to detect later. If the key column has duplicate values in both dataframes (a many-to-many relationship), the output will have more rows than either input, because every matching pair of rows produces a row in the output. If many rows in one dataframe have keys that do not appear in the other, an inner join will drop them without warning. Either situation means your subsequent analysis is working on data with a different shape than you expected. Checking shape before and after catches these problems immediately, before they propagate into aggregations or visualizations.

What would unpivot do to a dataframe with columns file, python_lines, and js_lines, and when would you want that shape?

unpivot would transform the dataframe from wide format (one row per file, two value columns) to long format (two rows per file, one column for the language name and one column for the line count). The result would have columns file, variable (containing the strings python_lines and js_lines), and value (containing the actual counts). You would want this shape when you need to do a group-by on language, pass the data to a plotting library that expects a single value column, or apply the same filter to both languages at once without writing the condition twice.

Exercises

Histogram of Function Sizes

Using the combined Python and JavaScript function-count data from func_counts.py, create a histogram of lines-per-function for each language using 20-line bins from 0 to 200 lines. Plot both histograms on the same chart using Altair, with language mapped to color. Write two sentences comparing the shapes of the two distributions: which language has longer functions on average, and which has a longer tail?

Files in Common

Load the Python and JavaScript function-count datasets and join them on the file name column using an outer join. After the join, count how many files appear in both datasets, how many appear only in the Python dataset, and how many appear only in the JavaScript dataset. Report these three counts and write a sentence explaining what a file appearing in both datasets would imply about the repository being analyzed.

Percentile Table by Language

Compute the 10th, 25th, 75th, and 90th percentiles of lines-per-function for each language as a single Polars dataframe with one row per language and four value columns. Sort by the 50th percentile (median). Identify which language has the largest difference between its 90th and 10th percentiles and write a sentence explaining what that difference tells you about the predictability of function size in that language.

Pivot to Wide Format

Start with the combined long-format dataframe from func_counts.py that has one row per file per language. Use pivot to produce a wide-format dataframe with one row per file and two columns: python_lines_per_func and js_lines_per_func. Check the shape of the result before and after the pivot. Write a sentence explaining what happened to files that exist in only one language's dataset, and whether that behavior is appropriate for a comparison study.

Mean Versus Median Ratio

Using the grouped statistics from func_counts.py, compute two ratios: the ratio of mean lines-per-function in Python to mean lines-per-function in JavaScript, and the ratio of median lines-per-function in Python to median lines-per-function in JavaScript. Report both ratios. Write three to four sentences explaining which ratio tells the more useful story for a researcher studying typical function size, and what the difference between the two ratios implies about the shape of the distribution.