Bibliography

Citations used in the main body of this material have short descriptions here. For descriptions of the other citations, see Further Reading; for the papers themselves, please see the pdf directory in the project repository.

A

Abad2018: Zahra Shakeri Hossein Abad, Oliver Karras, Kurt Schneider, Ken Barker, and Mike Bauer: "Task Interruption in Software Development Projects: What Makes some Interruptions More Disruptive than Others?" arXiv 1805.05508, 2018, 10.48550/arXiv.1805.05508.
AlencarDaCosta2017: Daniel Alencar da Costa, Shane McIntosh, Weiyi Shang, Uira Kulesza, Roberta Coelho, and Ahmed E. Hassan: "A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes." IEEE Trans. Software Engineering, 43(7), 641-657, 2017, 10.1109/tse.2016.2616306.
Anda2009: B.C.D. Anda, D.I.K. Sjoberg, and A. Mockus: "Variability and Reproducibility in Software Engineering: A Study of Four Companies that Developed the Same System." IEEE Trans. Software Engineering, 35(3), 407-429, 2009, 10.1109/tse.2008.89.
Aniche2021: Mauricio Aniche, Christoph Treude, and Andy Zaidman: "How Developers Engineer Test Cases: An Observational Study." IEEE Trans. Software Engineering, 2021, 10.1109/tse.2021.3129889.
Aranda2009: Jorge Aranda and Gina Venolia: "The secret life of bugs: Going past the errors and omissions in software repositories." Proc. ICSE'09, 2009, 10.1109/icse.2009.5070530.

B

Baltes2025: Sebastian Baltes, Florian Angermeir, Chetan Arora, Marvin Muñoz Barón, Chunyang Chen, Lukas Böhme, Fabio Calefato, Neil Ernst, Davide Falessi, Brian Fitzgerald, Davide Fucci, Marcos Kalinowski, Stefano Lambiase, Daniel Russo, Mircea Lungu, Lutz Prechelt, Paul Ralph, Christoph Treude, and Stefan Wagner: "Evaluation Guidelines for Empirical Studies in Software Engineering involving LLMs." arXiv 2508.15503, 2025, 10.48550/arXiv.2508.15503.
Bano2025: Muneera Bano, Hashini Gunatilake, and Rashina Hoda: "What Does a Software Engineer Look Like? Exploring Societal Stereotypes in LLMs." arXiv 2501.03569, 2025, 10.48550/arXiv.2501.03569.
Basili1987: V.R. Basili and R.W. Selby: "Comparing the Effectiveness of Software Testing Strategies." IEEE Trans. Software Engineering, SE-13(12), 1278-1296, 1987, 10.1109/tse.1987.232881.
Basili1994: Victor R. Basili, Gianluigi Caldiera, and H. Dieter Rombach: "The Goal Question Metric Approach." In John Marciniak (ed.), Encyclopedia of Software Engineering, Wiley, 1994.
Introduces the Goal/Question/Metric (GQM) framework for systematically defining software measurements by linking metrics to explicit goals and intermediate questions.
Bauer2019: Jennifer Bauer, Janet Siegmund, Norman Peitek, Johannes C. Hofmeister, and Sven Apel: "Indentation: Simply a Matter of Style or Support for Program Comprehension?." Proc. ICPC'19, 2019, 10.1109/icpc.2019.00033.
Beck2023: Kent Beck: "Measuring Developer Productivity: Real-World Examples." Medium, 2023. https://tidyfirst.substack.com/p/measuring-developer-productivity
Begel2014: Andrew Begel and Nachiappan Nagappan: "Analyze This! 145 Questions for Data Scientists in Software Engineering." Proc. ICSE'14, 2014, 10.1145/2568225.2568233.
Two surveys producing 145 data science questions for SE research in 12 categories; engineers prioritize customer usage questions; oppose questions assessing or comparing individual employee performance.
Behroozi2020: Mahnaz Behroozi, Shivani Shirolkar, Titus Barik, and Chris Parnin: "Does Stress Impact Technical Interview Performance?" Proc. ESEC/FSE'20, 481-492, 2020, 10.1145/3368089.3409712.
Beller2015: Moritz Beller, Georgios Gousios, Annibale Panichella, and Andy Zaidman: "When, how, and why developers (do not) test in their IDEs." Proc. FSE'15, 2015, 10.1145/2786805.2786843.
Large-scale field study of 416 software engineers over 5 months (13+ years of IDE activity); majority do not test; TDD not widely practiced; developers spend 25% of time on tests but believe they spend 50%.
Beller2018: Moritz Beller, Niels Spruit, Diomidis Spinellis, and Andy Zaidman: "On the Dichotomy of Debugging Behavior Among Programmers." Proc. ICSE'18, 2018, 10.1145/3180155.3180175.
Beller2019: Moritz Beller, Georgios Gousios, Annibale Panichella, Sebastian Proksch, Sven Amann, and Andy Zaidman: "Developer Testing in the IDE: Patterns, Beliefs, and Behavior." IEEE Trans. Software Engineering, 45(3), 261-284, 2019, 10.1109/tse.2017.2776152.
Bettenburg2008: Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann: "What makes a good bug report?" Proc. SIGSOFT/FSE'08, 2008, 10.1145/1453101.1453146.
Reports a survey of 466 Apache, Eclipse, and Mozilla developers identified which elements of bug reports practitioners find most useful. Stack traces, test cases, and steps to reproduce ranked highest, while information that reporters consider important was often missing from submitted reports.
Bird2011: Christian Bird, Nachiappan Nagappan, Brendan Murphy, Harald Gall, and Premkumar Devanbu: "Don't Touch My Code! Examining the Effects of Ownership on Software Quality." Proc. SIGSOFT/FSE'11, 2011, 10.1145/2025113.2025119.
Bogart2016: Christopher Bogart, Christian Kästner, James Herbsleb, and Ferdian Thung: "How to break an API: cost negotiation and community values in three software ecosystems." Proc. FSE'16, 2016, 10.1145/2950290.2950325.
Bouzenia2025: Islem Bouzenia and Michael Pradel: "Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories." Proc. ASE'25, 2025, 10.1109/ASE63991.2025.002344.
Brandt2009: Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer: "Two studies of opportunistic programming: interleaving web foraging, learning, and writing code." Proc. CHI'09, 2009, 10.1145/1518701.1518944.
Two studies (lab + query log) of programmers using web resources; identifies three purposes: just-in-time learning, knowledge extension, and reminding; queries for different purposes differ in style and duration.
Braun2019: Virginia Braun and Victoria Clarke: "Reflecting on Reflexive Thematic Analysis." Qualitative Research in Sport, Exercise and Health, 11(4), 2019, 10.1080/2159676X.2019.1628806.
Clarifies reflexive thematic analysis as a distinct qualitative methodology, contrasting it with other forms of thematic analysis.
Brown2024: Eva Maxfield Brown, Cailean Osborne, Peter Cihon, Moritz Böhmecke-Schwafert, Kevin Xu, Mirko Boehm, and Knut Blind: "Measuring Software Innovation with Open Source Software Development Data." arXiv 2411.05087, 2024, 10.48550/arXiv.2411.05087.
Butler2023: Jenna Butler, Thomas Zimmermann, and Christian Bird: "Objectives and Key Results in Software Teams: Challenges, Opportunities and Impact on Development." arXiv 2311.00236, 2023, 10.48550/arXiv.2311.00236.

C

Campbell1963: Donald T. Campbell and Julian C. Stanley: Experimental and Quasi-Experimental Designs for Research. Houghton Mifflin, 1963.
Classic research design textbook establishing the concepts of internal and external validity and the taxonomy of experimental and quasi-experimental designs.
Cosentino2016: Valerio Cosentino, Javier Luis, and Jordi Cabot: "Findings from GitHub." Proc. MSR'16, 2016, 10.1145/2901739.2901776.

D

Davis2023: Matthew C. Davis, Emad Aghayi, Thomas D. Latoza, Xiaoyin Wang, Brad A. Myers, and Joshua Sunshine: "What's (Not) Working in Programmer User Studies?." ACM Trans. Software Engineering and Methodology, 32(5), 1-32, 2023, 10.1145/3587157.
Devanbu2016: Prem Devanbu, Thomas Zimmermann, and Christian Bird: "Belief & evidence in empirical software engineering." Proc. ICSE'16, 2016, 10.1145/2884781.2884812.
Case study of developer beliefs at Microsoft vs. empirical project data; beliefs are strong but formed from personal experience rather than research; do not reliably correspond to actual evidence; recommends better dissemination of empirical findings.
Diener2010: Ed Diener, Derrick Wirtz, William Tov, Chu Kim-Prieto, Dong-won Choi, Shigehiro Oishi, and Robert Biswas-Diener: "New well-being measures: Short scales to assess flourishing and positive and negative feelings." Social Indicators Research, 97, 2010, 10.1007/s11205-009-9493-y.
Presents validated short scales for measuring human flourishing and positive/negative affect as components of subjective well-being.

E

ElEmam2001: K. El Emam, S. Benlarbi, N. Goel, and S.N. Rai: "The confounding effect of class size on the validity of object-oriented metrics." IEEE Trans. Software Engineering, 27(7), 2001, 10.1109/32.935855.
ElHaji2024: Khalid El Haji, Carolin Brandt, and Andy Zaidman: "Using GitHub Copilot for Test Generation in Python: An Empirical Study." Proc. AST'24, 2024, 10.1145/3644032.3644443.
Erdogmus2005: Hakan Erdogmus, Maurizio Morisio, and Marco Torchiano: "On the Effectiveness of the Test-First Approach to Programming." IEEE Trans. Software Engineering, 31(3), 2005, 10.1109/tse.2005.37.
Controlled experiment finding that TDD does not inherently improve code quality, but that test quantity regardless of writing order was the key driver of programmer productivity.

F

FernandezPinto2023: Manuela Fernández Pinto and Daniel Fernández Pinto: "Epistemic diversity and industrial selection bias." Synthese, 201(5), 2023, 10.1007/s11229-023-04158-7.
Flournoy2025: John C. Flournoy, Carol S. Lee, Maggie Wu, and Catherine M. Hicks: "No Silver Bullets: Why Understanding Software Cycle Time is Messy, Not Magic." arXiv 2503.05040, 2025, 10.48550/arXiv.2503.05040.
Forsgren2018: Nicole Forsgren, Jez Humble, and Gene Kim: Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018.
Presents evidence that elite DevOps organizations achieve both high speed and stability, and identifies CI/CD, lean management, and learning culture as the key predictors of performance.
Forsgren2021: Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler: "The SPACE of Developer Productivity." ACM Queue, 19(1), 2021.
Brief commentary noting developer productivity is more complex than commonly assumed.
Fu2025: Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, and Jinfu Chen: "Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study." ACM Trans. Software Engineering and Methodology, 34(8), 2025, 10.1145/3716848.
Fucci2013: Davide Fucci, Burak Turhan, and Markku Oivo: "Impact of Process Conformance on the Effects of Test-Driven Development." Proc. ESEM'13, 2013, 10.1109/esem.2013.19.
Fucci2016: Davide Fucci, Giuseppe Scanniello, Simone Romano, Martin Shepperd, Boyce Sigweni, Fernando Uyaguari, Burak Turhan, Natalia Juristo, and Markku Oivo: "An External Replication on the Effects of Test-driven Development Using a Multi-site Blind Analysis Approach." Proc. ESEM'16, 2016, 10.1145/2961111.2962592.
Fucci2017: Davide Fucci, Hakan Erdogmus, Burak Turhan, Markku Oivo, and Natalia Juristo: "A Dissection of the Test-Driven Development Process: Does It Really Matter to Test-First or to Test-Last?." IEEE Trans. Software Engineering, 43(7), 2017, 10.1109/tse.2016.2616877.
Fucci2018: Davide Fucci, Giuseppe Scanniello, Simone Romano, and Natalia Juristo: "Need for Sleep: The Impact of a Night of Sleep Deprivation on Novice Developers' Performance." arXiv 1805.02544, 2018, 10.48550/arXiv.1805.02544.

G

Girardi2020: Daniela Girardi, Nicole Novielli, Davide Fucci, and Filippo Lanubile: "Recognizing Developers' Emotions While Programming." Proc. ICSE'20, 2020, 10.1145/3377811.3380374.
Gold2020: Nicolas E. Gold and Jens Krinke: "Ethical Mining." Proc. MSR'20, 2020, 10.1145/3379597.3387462.
Goodhart1984: Charles Goodhart: "Problems of Monetary Management: The U.K. Experience." In Anthony Courakis (ed.), Inflation, Depression, and Economic Policy in the West, Rowman and Littlefield, 1984.
Articulates what became Goodhart's Law: when a measure becomes a policy target it ceases to be a good measure, because agents optimize for the metric rather than the underlying goal it represents.
Gote2022: Christoph Gote, Pavlin Mavrodiev, Frank Schweitzer, and Ingo Scholtes: "Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set." arXiv 2201.04588, 2022, 10.48550/arXiv.2201.04588.
Graziotin2018: Daniel Graziotin, Fabian Fagerholm, Xiaofeng Wang, and Pekka Abrahamsson: "What Happens When Software Developers Are (Un)Happy." Journal of Systems and Software, 140, 2018, 10.1016/j.jss.2018.02.041
Mixed-methods study finding that unhappy developers are less productive, produce lower-quality work, and have higher intent to leave their jobs.

H

Hall2019: Erika Hall: Just Enough Research. A Book Apart, 2nd ed., 2019, 9781952616082.
Practical guide to user research for designers and product teams, arguing that the goal of research is to reduce uncertainty enough to act wisely, and that small, focused studies are almost always more useful than no research at all.
Harman2001: Mark Harman and Bryan F. Jones: "Search-Based Software Engineering." Information and Software Technology, 43(14), 2001.
Introduces search-based software engineering, proposing metaheuristic search techniques as a general framework for automating software engineering tasks as optimization problems.
Hindle2016: Abram Hindle, Earl T. Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu: "On the naturalness of software." Comm. ACM, 59(5), 2016, 10.1145/2902362.
N-gram models show code is more repetitive and predictable than natural language; validates naturalness hypothesis; demonstrates improved Java code completion in Eclipse using statistical language models.

I

Inozemtseva2014: Laura Inozemtseva and Reid Holmes: "Coverage is Not Strongly Correlated with Test Suite Effectiveness." Proc. ICSE'14, 2014, 10.1145/2568225.2568271.

J

Johnson2013: Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge: "Why don't software developers use static analysis tools to find bugs?" Proc. ICSE'13, 2013, 10.1109/icse.2013.6606613.
Interview study with 20 developers on why static analysis tools are underused; all felt use is beneficial but false positives and unhelpful warning presentation are the main barriers; recommends interactive defect-fixing mechanisms.
Junior2009: Gibeon Soares de Aquino Junior and Silvio Romero de Lemos Meira: "Towards Effective Productivity Measurement in Software Projects." Proc. SEA'09, 2009, 10.1109/icsea.2009.44.
Juristo2001: Natalia Juristo and Ana M. Moreno: Basics of Software Engineering Experimentation. Springer, 2001, 9780792379904.
Textbook introducing the principles and techniques of controlled experimentation in software engineering, covering design, analysis, and validity evaluation.

K

Kalliamvakou2014: Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian: "The Promises and Perils of Mining GitHub." Proc. MSR'14, 2014, 10.1145/2597073.2597074.
Empirical analysis of GitHub data revealing systematic biases including that most projects are personal and inactive, and that pull-request data routinely misrepresents actual collaboration.
Kamei2013: Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E. Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi: "A large-scale empirical study of just-in-time quality assurance." IEEE Trans. Software Engineering, 39(6), 2013, 10.1109/tse.2012.70.
Kampenes2007: Vigdis By Kampenes, Tore Dybå, Jo Erskine Hannay, and Dag I.K. Sjøberg: "A Systematic Review of Effect Size in Software Engineering Experiments." Information and Software Technology, 49(11-12), 2007.
Systematic review finding that effect sizes are rarely reported in SE experiments and when reported are mostly small, suggesting many statistically significant SE results may not be practically meaningful.
Ko2007: Amy J. Ko, Robert DeLine, and Gina Venolia: "Information Needs in Collocated Software Development Teams." Proc. ICSE'07, 2007, 10.1109/icse.2007.45.
Observation study of 17 developers at a large software company; identifies 21 information types sought during change tasks; most frequently deferred: design rationale and program behavior; unavailable coworkers most common blocker.

L

Liang2024: Jenny T. Liang, Chenyang Yang, and Brad A. Myers: "A Large-Scale Survey on the Usability of AI Programming Assistants: Successes and Challenges." Proc. ICSE'24, 2024, 10.1145/3597503.3608128.
Survey of 410 developers finding that AI coding assistants are valued for reducing keystrokes but that trust, correctness verification, and context-awareness remain significant usability challenges.

M

Maalej2014: Walid Maalej, Rebecca Tiarks, Tobias Roehm, and Rainer Koschke: "On the Comprehension of Program Comprehension." ACM Trans. Software Engineering and Methodology, 23(4), 2014, 10.1145/2622669.
Mark2008: Gloria Mark, Daniela Gudith, and Ulrich Klocke: "The Cost of Interrupted Work: More Speed and Stress." Proc. CHI'08, 2008.
Controlled study finding that interrupted workers compensate by working faster to complete tasks in equivalent time, but do so at the cost of significantly higher stress and frustration.
McKinsey2023: Nora Elsayed, Tarek Elhounsri, and Sven Blumberg: "Yes, You Can Measure Software Developer Productivity." McKinsey & Company, 2023, https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/yes-you-can-measure-software-developer-productivity.
An embarrassingly bad collection of muddled claims about measuring developer productivity. Somebody probably got promoted for writing it.
Medlock2002: Michael C. Medlock, Dennis Wixon, Mark Terrano, Ramon Ruflair, and Darrell Vaughan: "Using the RITE Method to Improve Products: A Definition and a Case Study." Proc. Usability Professionals Association Conference, 2002.
Introduces the Rapid Iterative Testing and Evaluation (RITE) method, in which usability problems identified during a session are addressed before the next session, enabling rapid iteration with small participant pools.
Meyer2017: André N. Meyer, Laura E. Barton, Gail C. Murphy, Thomas Zimmermann, and Thomas Fritz: "The Work Life of Developers: Activities, Switches and Perceived Productivity." IEEE Trans. Software Engineering, 43(12), 2017, 10.1109/tse.2017.2656886.
Monitoring 20 developers over 11 work-days shows more user input correlates with higher perceived productivity; emails and planned meetings correlate negatively; productivity is highly personal and varies by time of day.
Meyer2021: André N. Meyer, Earl T. Barr, Christian Bird, and Thomas Zimmermann: "Today Was a Good Day: The Daily Life of Software Developers." IEEE Trans. Software Engineering, 47(5), 2021, 10.1109/tse.2019.2904957.
Miller2025: Courtney Miller, Rudrajit Choudhuri, Mara Ulloa, Sankeerti Haniyur, Robert DeLine, Margaret-Anne Storey, Emerson Murphy-Hill, Christian Bird, and Jenna L. Butler: ""Maybe We Need Some More Examples:" Individual and Team Drivers of Developer GenAI Tool Use." arXiv 2507.21280, 2025, 10.48550/arXiv.2507.21280.
Mockus2010: Audris Mockus: "Organizational Volatility and Its Effects on Software Defects." Proc. SIGSOFT/FSE'10, 2010, 10.1145/1882291.1882311.
Muller2015: Sebastian C. Muller and Thomas Fritz: "Stuck and Frustrated or in Flow and Happy: Sensing Developers' Emotions and Progress." Proc. ICSE'15, 2015, 10.1109/icse.2015.334.
Lab study (n=17) of developer emotions and biometric sensors during change tasks; emotions correlate with perceived progress; classifier achieves 71\% accuracy for positive/negative emotion and 68\% for low/high progress.
Munaiah2017: Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan: "Curating GitHub for engineered software projects." Empirical Software Engineering, 22(6), 2017, 10.1007/s10664-017-9512-6.
Framework classifying 1.8M+ GitHub repos as engineered software vs. noise; best classifier achieves 82\% precision / 86\% recall; outperforms stargazer-based approaches which have high precision but low recall.

N

Nagappan2008: Nachiappan Nagappan, Brendan Murphy, and Victor Basili: "The Influence of Organizational Structure on Software Quality: An Empirical Case Study." Proc. ICSE'08, 2008, 10.1145/1368088.1368160.
Newman2023: Kaia Newman, Madeline Endres, Brittany Johnson, and Westley Weimer: "From Organizations to Individuals: Psychoactive Substance Use By Professional Programmers." arXiv 2305.01056, 2023, 10.48550/arXiv.2305.01056.
Nielsen1993: Jakob Nielsen and Thomas K. Landauer: "A Mathematical Model of the Finding of Usability Problems." Proc. INTERACT'93 and CHI'93, 206-213, 1993, 10.1145/169059.169166.
Develops a mathematical model showing that approximately five participants are sufficient to identify most major usability problems in a focused task set, under the assumption of formative testing with a reasonably homogeneous user group.

O

Obi2024: Ike Obi, Jenna Butler, Sankeerti Haniyur, Brian Hassan, Margaret-Anne Storey, and Brendan Murphy: "Identifying Factors Contributing to Bad Days for Software Developers: A Mixed Methods Study." arXiv 2410.18379, 2024, 10.48550/arXiv.2410.18379.

P

Pearce2022: Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri: "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions." Proc. S&P'22, 2022, 10.1109/SP46214.2022.9833571.
Found that approximately 40% of code generated by GitHub Copilot across 89 security-relevant scenarios contained vulnerabilities drawn from the MITRE CWE Top 25 list.
Peng2023: Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer: "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." arXiv 2302.06590, 2023, 10.48550/arXiv.2302.06590.
Randomized controlled experiment claiming that GitHub Copilot users completed a JavaScript coding task 55.8% faster than the control group.
Prechelt2000: Lutz Prechelt: "An Empirical Comparison of Seven Programming Languages." IEEE Computer, 33(10), 2000, 10.1109/2.876288.
Compares 80 implementations of a phone-code program in C, C++, Java, Perl, Python, Rexx, Tcl; scripting languages require less code and effort but are slower; significant variation within each language.

Q

R

Ray2017: Baishakhi Ray, Daryl Posnett, Premkumar Devanbu, and Vladimir Filkov: "A large-scale study of programming languages and code quality in GitHub." Comm. ACM, 60(10), 2017, 10.1145/3126905.
Risse2025: Niklas Risse, Jing Liu, and Marcel Böhme: "Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection." Proc. ACM Software Engineering, 2025, 10.1145/3728887.
Survey of ML vulnerability detection literature; 90\% frame it as function-level binary classification; context is almost always necessary for accurate judgement; high scores achievable via spurious correlations; calls the prevailing problem statement ill-defined.

S

Sackman1968: Harold Sackman, W.J. Erikson, and E.E. Grant: "Exploratory Experimental Studies Comparing Online and Offline Programming Performance." Comm. ACM, 11(1), 1968.
Early empirical study finding up to 28:1 variation in individual programmer performance, which dwarfed any treatment effect and is often cited as the origin of the "10x programmer" concept.
Sadowski2019: Caitlin Sadowski and Thomas Zimmermann (eds.): Rethinking Productivity in Software Engineering. Apress, 2019, 9781484242216.
Edited volume collecting research and practitioner perspectives on how to understand, define, and measure software developer productivity.
SanchezRuiz2023: José Manuel Sánchez Ruiz, Francisco José Domínguez Mayo, Xavier Oriol, José Francisco Crespo, David Benavides, and Ernest Teniente: "A Benchmarking Proposal for DevOps Practices on Open Source Software Projects." arXiv 2304.14790, 2023, 10.48550/arXiv.2304.14790.
Sedano2017: Todd Sedano, Paul Ralph, and Cécile Péraire: "Software Development Waste." Proc. ICSE'17, 2017, 10.1109/icse.2017.20.
Two-year participant-observation study at Pivotal identifies 9 types of software development waste: wrong features, backlog mismanagement, rework, unnecessary complexity, cognitive load, psychological distress, waiting, knowledge loss, and poor communication.
Sillito2008: J. Sillito, G.C. Murphy, and K. De Volder: "Asking and Answering Questions during a Programming Change Task." IEEE Trans. Software Engineering, 34(4), 2008, 10.1109/tse.2008.26.
Two qualitative studies of programmers during change tasks; produces catalog of 44 question types; describes information-seeking behavior and how well existing tools support answering these questions.
Silva2016: Danilo Silva, Nikolaos Tsantalis, and Marco Tulio Valente: "Why We Refactor? Confessions of GitHub Contributors." Proc. FSE'16, 2016, 10.1145/2950290.2950305.
Spinellis2024: Diomidis Spinellis, Panos Louridas, Maria Kechagia, and Tushar Sharma: "Broken Windows: Exploring the Applicability of a Controversial Theory on Code Quality." arXiv 2410.13480, 2024, 10.48550/arXiv.2410.13480.
Stapleton2020: Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, Westley Weimer, Kevin Leach, and Yu Huang: "A Human Study of Comprehension and Code Summarization." Proc. ICPC'20, 2020, 10.1145/3387904.3389258.
Storey2022: Margaret-Anne Storey, Brian Houck, and Thomas Zimmermann: "How Developers and Managers Define and Trade Productivity for Quality." Proc. CHASE'22, 2022, 10.1145/3528579.3529177.
Storey2024: Margaret-Anne Storey, Rashina Hoda, Alessandra Maciel Paz Milani, and Maria Teresa Baldassarre: "Guidelines for Using Mixed Methods Research in Software Engineering." arXiv 2404.06011, 2024, 10.48550/arXiv.2404.06011.

T

Thongtanunam2016: Patanamon Thongtanunam, Shane McIntosh, Ahmed E. Hassan, and Hajimu Iida: "Revisiting Code Ownership and Its Relationship with Software Quality in the Scope of Modern Code Review." Proc. ICSE'16, 2016, 10.1145/2884781.2884852.
Thornberg2014: Robert Thornberg and Kathy Charmaz: "Grounded Theory and Theoretical Coding." In Uwe Flick (ed.), The SAGE Handbook of Qualitative Data Analysis, SAGE, 2014, 10.4135/9781446282243.
Explains grounded theory and theoretical coding as qualitative analysis tools, emphasizing systematic yet flexible concept development from data.
Tregubov2017: Alexey Tregubov, Barry Boehm, Natalia Rodchenko, and Jo Ann Lane: "Impact of task switching and work interruptions on software development processes." Proc. ICSSP'17, 2017, 10.1145/3084100.3084116.
Treude2024: Christoph Treude: "Qualitative Data Analysis in Software Engineering: Techniques and Teaching Insights." arXiv 2406.08228, 2024, 10.48550/arXiv.2406.08228.
Tufano2017: Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, and Denys Poshyvanyk: "When and Why Your Code Starts to Smell Bad (and Whether the Smells Go Away)." IEEE Trans. Software Engineering, 43(11), 2017, 10.1109/tse.2017.2653105.
Large empirical study of 200 OSS project histories; most code smells are introduced when artifacts are created, not during evolution; 80\% survive; only 9\% of removed smells are directly caused by refactoring operations.

U

Uyaguari2024: Fernando Uyaguari, Silvia T. Acuña, John W. Castro, Davide Fucci, Oscar Dieste, and Sira Vegas: "Relevant information in TDD experiment reporting." ACM Trans. Software Engineering and Methodology, 2024, 10.1145/3688837.

V

Vartziotis2025: Tina Vartziotis, Maximilian Schmidt, George Dasoulas, Ippolyti Dellatolas, Stefano Attademo, Viet Dung Le, Anke Wiechmann, Tim Hoffmann, Michael Keckeisen, and Sotirios Kotsopoulos: "Carbon Footprint Evaluation of Code Generation through LLM as a Service." arXiv 2504.01036, 2025, 10.48550/arXiv.2504.01036.

W

Wessel2021: Mairieli Wessel, Igor Wiese, Igor Steinmacher, and Marco Aurelio Gerosa: "Don't Disturb Me: Challenges of Interacting with Software Bots on Open Source Software Projects." Proc. ACM Human-Computer Interaction, 2021, 10.1145/3476042.
Interview study of 21 OSS practitioners on bots in pull requests; identifies noise (overwhelming and distracting bot output) as central problem; develops theory of annoying bot behavior as noise; recommendations for bot and platform designers.
Wicherts2011: J.M. Wicherts, M. Bakker, and D. Molenaar: "Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results". PLoS ONE, 6(11): e26828, 2011, 10.1371/journal.pone.0026828.
Found that the reluctance to share data was associated with weaker evidence and a higher prevalence of statistical errors. The unwillingness to share data was particularly clear when reporting errors had a bearing on statistical significance.
Wohlin2000: Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén: Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, 2000, 9783662693056.
Introduces principles and methods for conducting controlled experiments in software engineering, covering design, execution, analysis, and identification of validity threats.
Wyrich2023: Marvin Wyrich: "Source Code Comprehension: A Contemporary Definition and Conceptual Model for Empirical Investigation." arXiv 2310.11301, 2023, 10.48550/arXiv.2310.11301.

X

Y

Z

Zieris2014: Franz Zieris and Lutz Prechelt: "On knowledge transfer skill in pair programming." Proc. ESEM'14, 2014, 10.1145/2652524.2652529.
Qualitative analysis of industrial pair programming recordings; efficient pairs avoid explaining multiple things at once, maintain topic focus, and clarify in stages; identifies knowledge transfer as a distinct skill beyond programming ability.