Software Engineering: An Evidence-Based Approach

Are some programming languages easier for novices to pick up than others? Does test-driven development produce higher-quality code in less time? Is there any truth behind the “10X programmer” meme? And if a team has limited resources for testing, what should they focus on? Empirical research has answers to all of these questions and more, but most programmers don’t know that research exists. As a result, many continue to use inefficient or insecure practices when better alternatives exist.

I would like teach programmers (both students and practitioners) what we actually know about software engineering and why we believe it’s true. After a few false starts, I believe the best way to do this is to teach them some basic data science, then have them use their new knowledge and skills to understand analyses of what makes software engineers more effective. This show-don’t-tell approach would ensure that they know what we already know, that what they learn sticks, and that they have the tools they need to analyze and improve their own work.

More specifically, I would like to create lessons suitable for self-study, industry training, and classroom use that would:

  1. introduce key data science concepts so that learners can understand the methods and claims of empirical studies;

  2. teach them how to reproduce (scaled-down versions of) key results using Python and freely-available datasets; and

  3. present and discuss those key results and others.

All of the material would published with a Creative Commons license to enable extension and translation. I would work with instructors in both academia and industry to ensure that the material is relevant, approachable, and actually adopted. Once the material stabilized it would be published as a book.


How long will this take?
I believe a minimum viable set of lessons can be ready to test in six months, and the first release can be ready six months after that.
What do you need?
Funding for one person for 12 months to build, try out, and proselytize the lessons; funding for a graduate-level researcher for 6 months to study the use and impact of these lessons would help a lot too.
What are your qualifications for doing this?
Software Carpentry, a non-profit I founded in 2010 to teach programming skills to researchers, has delivered several thousand workshops worldwide to over 80,000 people; the books I organized and edited on software architecture and empirical software engineering research have been read by tens of thousands of people; I received the ACM SIGSOFT Influential Educator of the Year Award in 2020; and I have extensive contacts in both academia and industry.
Why can’t you do this on your own time?
I tried: it was going to be next after Teaching Tech Together, Software Tools in JavaScript, and Building Software Together. What I found while prototyping is that the material requires so much more unbroken concentration that I don’t think I will complete it as an evening-and-weekend project.
Why are you teaching data science rather than just summarizing published results?
The average computer science student does a lot less data analysis in college than the average biologist or economist, so most programmers don’t have the background needed to understand what’s actually being claimed or the evidence for it. If we’re going to reach the majority we need to give them a foundation; presenting this as “let’s teach you the Python you need to wrangle data” seems most likely to hook them.
Why are you teaching data science when there are so many good free courses already?
People learn best if the examples are drawn from their domain: people in finance want finance examples, people in pharma want pharma data, etc., so we should use software engineering data to introduce statistical thinking to programmers. Generic courses will reach the 5-15% of programmers who are just generally interested in learning new things, but I’d like to go wider than that.
What data science will these lessons cover?
Data cleaning, reproducible analysis, significance testing, linear regression, clustering, and classic visualization will be enough for people to follow and evaluate the overviews that follow. This material is necessary because many programmers have no statistical training and need to be introduced to the software tools used later.
What research findings will be covered?
The questions raised in the opening paragraph will definitely be addressed; other topics will be prioritized by polling an advisory group to find out what’s most likely to appeal to the target audience, and based on what datasets researchers are willing to share. A few favorites include showing that the distribution of error messages follow Zipf’s Law, Altadmri & Brown’s work on the mistakes novices make (and the fact that their teachers mis-predict common errors), Xu et al’s work on how infrequently used most configuration options are, Nakshatri et al’s demonstration that exceptions are mostly ignored, and Zeller et al’s “IROP” study.
What tools will be used for practical examples?
We will use the scientific Python stack, primarily because the average programmer is more likely to have encountered Python than R, Julia, or other data science languages.
Will these lessons cover qualitative methods as well as quantitative ones?
They should—learners should be aware of the unique insights that qualitative methods can produce—but it will depend on whether compelling exercises can be created.
Do you think people will actually read it?
Yes: something that bills itself as “data science for software engineers using software engineering examples” is going to strike a lot of chords, as will “here’s what we actually know about building better software faster”.
Who else has tried to do this?
Facts and Fallacies of Software Engineering, Rapid Development, Making Software, and Accelerate have presented empirical evidence about software development in one form or another (and have all sold pretty well). Empirical Software Engineering using R is more recent and more comprehensive, but is too dense for our target audience.