Software Engineering's
Greatest Hits

Greg Wilson

May 2017

The Seven Years War

Sea Battle
  • The British lost 1512 sailors to enemy action...
  • ...and 100,000 to scurvy

Oh, the Irony...

James Lind
  • James Lind (1716-94)
  • 1747: the first controlled experiment in medical history
sea watercider
sulfuric acidvinegar
barley wateroranges

It Took a While

  • 1950: Hill & Doll publish a case-control study comparing smokers with non-smokers
  • Smoking causes lung cancer
  • Most people would rather fail than change

How Are We Doing?

Martin Fowler

"[Using domain-specific languages] leads to two primary benefits: improved programmer productivity [and] communication with domain experts."
— Martin Fowler, IEEE Software, 2009

How Are We Doing?

Martin Fowler
  • One of the smartest people in the industry...
  • ...made two claims in a peer-reviewed journal...
  • ...without a single citation...
  • ...because nobody expected one

We Can Do Better

  • Steady growth over 20 years of empirical studies
  • Fueled by availability of data
  • And by realization that practitioners find most "classical" software engineering research irrelevant
  • Many studies are small, and not all are well done, but the trend is clear
ICSE 2017

Are Some Languages Better Than Others?

Stefik et al 2013: An Empirical Investigation into Programming Language Syntax

  • First studied compared learnability of
    • Perl
    • Quorum (the language their team is building)
    • Randomo (a placebo whose syntax was "designed" by rolling D&D dice)
  • Conclusion: Perl is as hard for novices to learn as a language with a randomly-designed syntax

We first present two surveys conducted with students on the intuitiveness of syntax, which we used to garner formative clues on what words and symbols might be easy for novices to understand. We followed up with two studies on the accuracy rates of novices using a total of six programming languages: Ruby, Java, Perl, Python, Randomo, and Quorum. To our surprise, we found that languages using a more traditional C-style syntax (both Perl and Java) did not afford accuracy rates significantly higher than a language with randomly generated keywords, but that languages which deviate (Quorum, Python, and Ruby) did.

Are Some Languages Better Than Others?

  • Second study
    • More subjects and multiple assessment strategies
    • Languages in the C family are as hard to learn as a randomly-designed language
    • Ruby and Python are significantly easier
    • Quorum is easier still
  • Reaction has shown just how little most developers know or care about the scientific method
  • Discussed in this podcast

Is Strong Typing Better Than Dynamic Typing?

Hanenberg et al 2014: An empirical study on the impact of static typing on software maintainability

  • Is strong typing useful?
  • Short answer: yes, it helps people understand undocumented code.
  • Interesting finding: people using dynamically-typed languages look at different files more frequently when programming

This paper describes an experiment that tests whether static type systems improve the maintainability of software systems, in terms of understanding undocumented code, fixing type errors, and fixing semantic errors. The results show rigorous empirical evidence that static types are indeed beneficial to these activities, except when fixing semantic errors. [Our] exploratory analysis [shows] that developers using a dynamic type system tend to look at different files more frequently when doing programming tasks—which is a potential reason for the observed differences in time.

Is Strong Typing Better Than Dynamic Typing?

Is Strong Typing Better Than Dynamic Typing?

Another study: 729 GitHub projects, 29,000 authors, 80 million lines of code in 17 languages.

...strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing.


...the modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size.

You Can't Just Ask Them

Rossbach et al 2010: Is Transactional Programming Actually Easier?

  • Software transactional memory treats shared memory operations like database transactions
  • How does it compare to locking?
  • Study 147 undergrads learning concurrent programming using traditional mechanisms or STM
  • Students did better with STM...
  • ...but thought they had done worse

...we describe a user-study in which 147 undergraduate students in an operating systems course implemented the same programs using coarse and fine-grain locks, monitors, and transactions... subjective evaluation showed that students found transactions harder to use than coarse-grain locks, but slightly easier to use than fine-grained locks. Detailed examination of synchronization errors in the students' code tells a rather different story. Overwhelmingly, the number and types of programming errors the students made was much lower for transactions than for locks. On a similar programming problem, over 70% of students made errors with fine-grained locking, while less than 10% made errors with transactions.

You Really Can't Ask Them

Altadmri & Brown 2016: 37 Million Compilations: Investigating Novice Programming Mistakes in Large-Scale Student Data

  • Ask educators for learners' most common mistakes
  • Compare their answers to data from the BlueJ Blackbox project
  • Weak consensus among educators
  • Weak correlation with observations
  • Educator experience had only weak effect on results

We used the Blackbox data set to check whether the educators' opinions matched data from over 100,000 students and checked whether this agreement was mediated by educators' experience. We found that educators formed only a weak consensus about which mistakes are most frequent, that their rankings bore only a moderate correspondence to the students' data.

You Really Can't Ask Them

  • Most common actual errors are:
    • Mis-matched parentheses (not confusing = with ==)
    • Invoking methods with the wrong arguments is #2
    • Control flow reaching end of non-void method without return is #3
  • The three that take the most time to fix are:
    1. Confusing short-circuit logical operators bitwise equivalents
    2. Using == instead of .equals to compare strings
    3. Ignoring the return value from a non-void method

Let's Talk About Test-Driven Development...

Erdogmus et al: "How Effective is Test-Driven Development?" (in Making Software, 2010)

  • Meta-analysis of 22 quantitative studies covering 32 unique trials
  • Undergrads to professionals
  • A few person-hours to 21,600

Let's Talk About Test-Driven Development...

Erdogmus et al: "How Effective is Test-Driven Development?" (in Making Software, 2010)

[e]vidence from controlled experiments suggests an improvement in productivity when TDD is used. However...pilot studies provide mixed evidence, some in favor of and others against TDD. In the industrial studies...evidence suggests that TDD yields worse productivity. Even when considering only the more rigorous studies...the evidence is equally split for and against a positive effect.

Let's Talk About Test-Driven Development...

Fucci et al 2016: An External Replication on the Effects of Test-driven Development Using a Multi-site Blind Analysis Approach

  • 39 professionals working on real projects
  • Replication of study done by other researchers
  • No significant difference between test-first and test-last development

Method: We analyzed 82 data points collected from 39 professionals, each capturing the process used while performing a specific development task. We built regression models to assess the impact of process characteristics on quality and productivity. Quality was measured by functional correctness. Result: Quality and productivity improvements were primarily positively associated with the granularity and uniformity. Sequencing, the order in which test and production code are written, had no important influence. Refactoring effort was negatively associated with both outcomes. We explain the unexpected negative correlation with quality by possible prevalence of mixed refactoring. Conclusion: The claimed benefits of TDD may not be due to its distinctive test-first dynamic, but rather due to the fact that TDD-like processes encourage fine-grained, steady steps that improve focus and flow.

Let's Talk About Test-Driven Development...

Fucci et al 2016: A Dissection of Test-Driven Development: Does It Really Matter to Test-First or to Test-Last?

  • "The claimed benefits of TDD may not be due to its test-drive dynamic, but rather due to the fact that [it] encourages fine-grained steady steps that improve focus and flow."
  • Discussion has been heated
    • "I practice TDD...and it works great. We don't need to prove that it works anymore... [T]here are some great stories on [my] site."

What Else Can't We Measure?

A Surprising Result

Bird et al 2009: Does Distributed Development Affect Software Quality? An Empirical Case Study of Windows Vista

Geographic distribution has little effect on bug rates

Distribution of team members in the org chart is a much better predictor

Why Don't People Use UML?

Petre 2014: UML in Practice

  • Interviewed 50 experienced developers about why they do or don't use UML:
    • Lack of context: UML deals with architecture, rather than with the whole system
    • The overheads of understanding the notation
    • Synchronization and consistency
  • Shows how rigorous qualitative studies can give insights quantitative studies cannot

Responses concerning UML use tend to be polarized, between design use and implementation use... Despite the notional accommodation of the whole process, informants tend to use UML either in early design, or in implementation, rarely both (even when informants' roles include the whole process).

More About Diagrams

Cherubini & Venolia 2007: Let's Go to the Whiteboard

  • Look at what developers draw when they're talking to teach other...
  • ...and how well they can understand their own drawings hours or days later
  • Diagrams are a cache for short-term memory, not archival...
  • ...which may explain why UML hasn't caught on

Most of the diagrams had a transient nature because of the high cost of changing whiteboard sketches to electronic renderings. Diagrams that documented design decisions were often externalized in these temporary drawings and then subsequently lost. Current visualization tools and the software development practices that we observed do not solve these issues,

What Happens When Teams Go Agile?

Khomh et al 2012: Do Faster Releases Improve Software Quality?

  • Looked at Firefox before and after the transition to rapid release and found:
    1. Users do not experience more post-release bugs
    2. Bugs are fixed faster
    3. When crashes do happen, they happen sooner after startup
  • Still don't have an explanation for that last one...
    • ...which is how science progresses

We found that (1) with shorter release cycles, users do not experience significantly more post-release bugs and (2) bugs are fixed faster, yet (3) users experience these bugs earlier during software execution (the program crashes earlier).

Terry Pratchett on Science

Actionable Findings

Nakshatri et al 2016: Analysis of Exception Handling Patterns in Java Projects: An Empirical Study

  • Most common catch block logs the error rather than trying to recover from it
  • Next most common do nothing (20% of cases) or convert the checked exception into an unchecked exception so that it can be ignored.
  • Most programmers ignore the exception hierarchy and simply catch Exception (78%) or Throwable (84%)

Actionable Findings

Yuan et al 2014: Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems

  • 198 randomly selected, user-reported failures on Cassandra, Hadoop MapReduce, etc.
  • Almost all failures require <=3 nodes to reproduce
  • Error logs typically contain sufficient data to reproduce
  • Majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code

Paradise Unplugged

Ford et al 2016: Paradise Unplugged: Identifying Barriers for Female Participation on Stack Overflow

  • Only 5-6% of Stack Overflow contributors are women
  • What do they find significantly more problematic than men?
    1. Lack of awareness of site features
    2. Feeling unqualified to answer questions
    3. Intimidating community size
    4. Discomfort interacting with or relying on strangers
    5. Perception that they shouldn't be "slacking" online communities, such as Stack Overflow...only 5.8% of contributors are female.... Through 22 semi-structured interviews with a spectrum of female users ranging from non-contributors to a top 100 ranked user of all time, we identified 14 barriers preventing them from contributing to Stack Overflow. We then conducted a survey with 1470 female and male developers to confirm which barriers are gender related or general problems for everyone.

Open Source in General

Steinmacher et al: Social Barriers Faced by Newcomers Placing Their First Contribution in Open Source Software Projects

  • Identify 58 potential barriers (including 13 social barriers)
  • What matters most?
    1. How easy is it to get set up to make a contribution?
    2. How easy is it to find a task to start with?
  • Other work has also identified "how warmly was my first contribution received?"
  • How do your code and community measure up?

...our study qualitatively analyzed social barriers that hindered newcomers' first contributions. We defined a conceptual model composed of 58 barriers including 13 social barriers. The barriers were identified from a qualitative data analysis considering different sources: a systematic literature review; open question responses gathered from OSS projects' contributors; students contributing to OSS projects; and semi-structured interviews with 36 developers from 14 different projects.

There Is No "Geek Gene"

Patitsas et al 2016: Computer Science Grades Are Not Bimodal

  • The "geek gene" is computing's most enduring and damaging myth
  • But only 5.8% of course grade distributions at a large university were actually multi-modal
  • And CS faculty are more likely to see distributions as bimodal if they think they're from a CS class
    • Even more likely if they believe some students are innately predisposed to do well in CS
  • Beliefs shape actions whose results reinforce beliefs

We statistically analyzed 778 distributions of final course grades from a large... university, and found only 5.8%...passed tests of multimodality. We then... showed 53 CS professors a series of histograms displaying ambiguous distributions and asked them to categorize the distributions. A random half of participants were primed to think about the fact that CS grades are commonly thought to be bimodal; these participants were more likely to label ambiguous distributions as "bimodal". Participants were also more likely to label distributions as bimodal if they believed that some students are innately predisposed to do better at CS.

When I Rule the World

  • Software engineering courses will include assignments like this:
    Given version control repositories for six software projects, determine whether long functions and methods are more likely to be buggy than short ones.
  • Requires tool use, model building, and statistics
  • Encourages students to do science, so they understand it, so they value it
  • Fits into existing curriculum
  • Culturally defensible

When I Rule the World

And this:

People of East Asian or South Asian ancestry make up 8% of the general population, but 50-60% of undergraduates in Computer Science at major universities. Write two 1000-word position papers to argue pro and con the proposition that this proves people of European descent are naturally less capable of logical thinking than their Asian counterparts.

We may not be able to teach empathy, but we can teach skepticism.


This is the world we need
right now

So let's get started