Datasets

Known

Every lesson in this tutorial uses a dataset stored as a CSV file in a data/ subdirectory relative to the lesson directory. This appendix lists each file, describes its contents, identifies its original source, and shows which lessons use it.

copilot_snippets.csv
- A curated set of short code snippets generated by GitHub Copilot, with an identifier and the raw code for each snippet.
- Columns: id, code.
- Source: collected for this tutorial from Copilot-assisted coding sessions; no public download URL.
- Lessons: aitools
educator_rankings.csv
- Predictions by computer science educators about which Java programming mistakes novice students make most often, alongside actual frequencies measured by the Blackbox system, which logged compilation attempts from millions of students worldwide.
- Columns: blackbox_rank, educator_1 through educator_n (one column per surveyed educator).
- Source: [Brown2017]; raw Blackbox data at blackbox.
- Lessons: nonpar
fucci_replication.csv
- Published summary statistics from the Fucci et al. TDD studies, used as a reference baseline when checking whether a local replication of their analysis produces the same numbers.
- Columns match the outcome variables in fucci_tdd.csv.
- Source: [Fucci2016], [Fucci2017]; no separate download URL.
- Lessons: design
fucci_tdd.csv
- Outcome measurements from a controlled experiment comparing test-driven development (TDD) with test-last development (TLD).
- Columns: approach (TDD or TLD), PROD (productivity), QLTY (external quality), TESTS (number of passing tests).
- Source: [Fucci2016], [Fucci2017]; no separate download URL.
- Lessons: effectsize, tddlab
github_repos.csv
- Metadata for a sample of GitHub repositories, used to apply the Kalliamvakou criteria for distinguishing repositories that represent real software projects from those that are personal experiments, mirrors, or course assignments.
- Columns: commits, contributors, last_commit, stars.
- Source: collected via GitHub API; see [Kalliamvakou2014] for criteria.
- Lessons: threats
jccpprtTR.csv
- Individual-level measurements from Prechelt's study in which programmers solved the same telephone-book lookup problem in one of seven languages. Each row is one programmer's submission.
- Columns include person, lang (programming language), whours (hours worked), and stmtL (non-comment source lines).
- Source: [Prechelt2000], [Prechelt2019]; data PDF at prechelt-data.
- Lessons: intro, visualize
js_func_counts.csv
- Per-file metrics from a sample of public JavaScript repositories, parallel in structure to py_func_counts.csv.
- Columns: lines (total lines in file), functions (number of functions defined).
- Source: collected from public GitHub repositories via the GitHub API; no public download URL.
- Lessons: grouping, correlate
js_line_lengths.csv
- Line-length measurements from a sample of public JavaScript source files, one row per line, parallel in structure to py_line_lengths.csv.
- Columns: file_id, line_length.
- Source: collected from public GitHub repositories via the GitHub API; no public download URL.
- Lessons: hypotest
line_lengths.csv
- Aggregated line-length counts across a large sample of Python source files, distinguishing installed library code from user-written scripts.
- Columns: filepath, line_length, count.
- Source: collected from Python standard library and third-party packages installed in a typical environment; no public download URL.
- Lessons: tidy, pep8lab
numpy_commits.csv
- Commit counts per contributor for the NumPy project, used to measure inequality in contribution using the Gini coefficient.
- Column: commit_count.
- Source: collected from the NumPy GitHub repository via GitPython; no public download URL.
- Lessons: mining
programmer_hours.csv
- Self-reported daily working hours for a sample of programmers, labeled by whether the day was a weekday or weekend.
- Columns: hours, day_type.
- Source: constructed for this tutorial; no public download URL.
- Lessons: compare
py_func_counts.csv
- Per-file metrics from a sample of public Python repositories.
- Columns: lines (total lines in file), functions (number of functions defined).
- Source: collected from public GitHub repositories via the GitHub API; no public download URL.
- Lessons: grouping, visualize, correlate
py_line_lengths.csv
- Line-length measurements from a sample of public Python source files, one row per line.
- Columns: file_id, line_length.
- Source: collected from public GitHub repositories via the GitHub API; no public download URL.
- Lessons: hypotest
pypi_releases.csv
- Release counts for packages listed on the Python Package Index (PyPI), one row per package.
- Column: releases.
- Source: queried from PyPI; no separate download URL.
- Lessons: describe
scikit-learn_commits.csv
- Commit counts per contributor for the scikit-learn project, parallel in structure to numpy_commits.csv.
- Column: commit_count.
- Source: collected from the scikit-learn GitHub repository via GitPython; no public download URL.
- Lessons: mining
shell-novice_commits.csv
- Commit counts per contributor for the Software Carpentry shell-novice lesson, parallel in structure to numpy_commits.csv.
- Column: commit_count.
- Source: collected from the shell-novice GitHub repository via GitPython; no public download URL.
- Lessons: mining

Needed

The entries below cover datasets that tutorials reference but have not yet been collected. The first two entries are specific files with known structures; the remainder are open-ended.

Specific missing files

Fucci sleep deprivation data
- Outcome measurements from the Fucci et al. study of the effect of one night of sleep deprivation on novice developers' programming performance. The exercise expects two groups of roughly 22 participants each (sleep-deprived vs. control) with at least one continuous outcome variable per participant, suitable for a t-test vs. Mann-Whitney comparison.
- Source: [Fucci2018]
- Needed by: compare — exercise "Choosing a Test Without Peeking"
Line lengths for a third programming language
- A file with the same structure as py_line_lengths.csv and js_line_lengths.csv (columns file_id and line_length, one row per source line) for at least one additional language such as Java, Ruby, or TypeScript. The exercise asks learners to apply pairwise t-tests to "at least three language pairs," which requires data for a minimum of three languages.
- Source: collect from public GitHub repositories via the GitHub API, following the same procedure used for py_line_lengths.csv and js_line_lengths.csv.
- Needed by: hypotest — exercise "Pairwise test function"

Open-ended data collected by learners

DevEx self-assessment ratings
- Two programming tasks the learner completed in the past two weeks — one frustrating, one productive — rated 1–5 on three DevEx dimensions (feedback loops, flow state, cognitive load). This is personal introspective data with no fixed format; each learner generates their own.
- Needed by: intro — exercise "DevEx Self-Assessment"
Recent commit messages from a public repository
- Five commit messages fetched by the learner from any open-source GitHub repository of their choice. Used to practice reading developer intent from short texts and comparing interpretations with a partner.
- Needed by: qualdata — exercise "Reading commit messages for motivation"
Ten Stack Overflow comments
- Ten comments on any Python or JavaScript Stack Overflow question, chosen by the learner. Used first for open coding and theme development, then reused for the inter-rater reliability exercise in the following lesson.
- Needed by: themes — exercise "Coding Stack Overflow comments"; themes — exercise "Documenting an interpretive choice"; reliability — exercise "Computing kappa on your own codes"
Open-source documentation examples
- Three examples of a specific Aghajani et al. documentation issue category (such as "incorrect documentation," "incomplete examples," or "missing rationale") found by the learner in any open-source library they have used.
- Needed by: themes — exercise "Finding examples of a documentation theme"
Pull request review comments from a public repository
- Ten review comments from any public GitHub repository, chosen by the learner. Used to develop and apply a three-code codebook, then shared with a partner for independent coding so that Cohen's kappa can be computed.
- Needed by: reliability — exercise "Writing and applying a codebook"
Learner-constructed correlation dataset
- Two numerical variables constructed or found by the learner such that the Pearson r is between 0.9 and 1.0, but a scatter plot reveals an obvious non-linearity or a single dominant outlier driving the correlation. The exercise suggests using a quadratic relationship sampled at evenly-spaced x values, or any dataset where removing one point drops r below 0.5.
- Needed by: visualize — exercise "Anscombe's Fifth Example"

Published papers that instructors must supply

Short SE empirical papers for paired critique
- One short SE empirical paper per pair of learners, covering a range of study types (controlled experiment, observational study, mining study). Learners identify research questions, variables, sample sizes, statistical methods, strengths, and unacknowledged validity threats.
- Needed by: design — exercise "Paper Critique (Pairs Exercise)"; reading — exercise "Paired Paper Review"
Zieris & Prechelt (2021) paper
- The full text or abstract of Zieris & Prechelt [Zieris2021], which uses grounded theory to study pair programming. Learners identify the core category and explain how it connects to other concepts in the model.
- Needed by: grounded — exercise "Core category in Zieris & Prechelt"
Any qualitative SE paper with a threats-to-validity section
- One published qualitative SE paper chosen by the learner or supplied by the instructor. Learners read its threats-to-validity section, list acknowledged threats, identify at least one unacknowledged threat, and evaluate the seriousness of the gap.
- Needed by: themes — exercise "Evaluating threats to validity in a qualitative paper"; reading — exercise "Threats Not Acknowledged"