Why We Don't Teach Testing (Even Though We'd Like To)

Posted 2014-10-30

If you haven't been following Lorena Barba's course on numerical methods in Python, you should. It's a great example of how to use emerging tools to teach more effectively, and if we ever run Software Carpentry online again, we'll do it her way. Yesterday, though, when she posted this notebook, I tweeted, "Beautiful… but where are the unit tests?" In the wake of the discussion that followed, I'd like to explain why we no longer require people to teach testing as part of the Software Carpentry core, and then ask you all a favor.

To begin with, though, I should make three things clear. First, I believe very strongly that testing is a key software development practice–so much so that I'm very reluctant to use any library that doesn't come with a suite of tests. Second, I believe that scientific software is just as testable as any other kind of software, and that a lot of scientists test their software well. Third, I think it's great that several of our instructors do still teach testing, and I'd like to see it back in the core some day.

So why was testing taken off the list of topics that must be taught in order for a workshop to be called "Software Carpentry"? The answer is that our lessons weren't effective: while most learners adopted shell scripting, started writing functions, and put their work under version control after a workshop, very few started writing unit tests.

The problem isn't the concept of unit testing: we can explain that to novices in just a couple of minutes. The problem isn't a lack of accessible unit testing frameworks, either: we can teach people Nose just as soon as they've learned functions. The problem is what comes next. What specific tests do we actually teach them to write? Every answer we have (a) depends on social conventions that don't yet exist, and (b) isn't broadly appealing.

For example, suppose we wanted to test the the Python 3 entry in the n-body benchmark game. The key function, advance, moves the system forward by a single time step. It would be pretty easy to construct a two-body system with a unit mass at the origin and another mass one AU away, figure out how far each should move in a single day, and check that the function got the right answer, but anything more complicated than that runs into numerical precision issues. At some point, we have to decide whether the actual answer is close enough to the expected answer to count as a pass. The question learners ask (quite reasonably) is, "How close is close enough?"

My answer was, "I don't know–you're the scientist." Their response was, "Well, I don't know either–you're the computer scientist." Books like these aren't much help. Their advice boils down to, "Think carefully about your numerical methods," but that's like telling a graphic designer to think carefully about the user: a fair response is, "Thanks–now can you please tell me what to think?"

What I've realized from talking with people like Diane Kelly and Marian Petre is that scientific computing doesn't (yet) have the cultural norms for error bars that experimental sciences have. When I rolled balls down an inclined plane to measure the strength of the earth's gravity back in high school, my teacher thought I did (suspiciously) well to have a plus or minus of only 10%. A few years later, using more sophisticated gear and methods in a university engineering class, I wasn't done until my answers were within 1% of each other. The difference between the two was purely a matter of social expectations, and that's true across all science. (As the joke goes, particle physicists worry about significant digits in the mantissa, while astronomers worry about significant digits in the exponent, and economists are happy if they can get the sign right…)

The second problem is the breathtaking diversity of scientific code. Scientific research is highly specialized, which means that the tests scientists write are much less transferable or reusable than those found in banking, web development, and the like. The kinds of tests we would write for a clustering algorithm will be very different from those we'd write for a fluid dynamics simulation, which would in turn be different from those we would write for a program that flagged cancerous cells in microscope images or one that cleaned up economic data from the 1950s.

For example, Lorena Barba commented on an earlier version of this post by saying:

You reference in your post our lesson on the full (nonlinear) phugoid model…

If you notice there, and also in the earlier lesson on the simpler linear model, we introduce grid-convergence analysis–a methodical way of executing code verification in numerical computing. This is not common: hardly any beginner course in numerical methods will cover observed order of convergence in this way. I believe this is the right approach: we are emphasizing a technique that should be used in practice to show evidence that the code is computing a numerical solution that converges as expected with grid refinement.

That's another example of what makes Lorena's course great, but (a) the testing method isn't something that a microbiologist or economist would ever use, and (b) that notebook also includes this:

The order of convergence is p = 1.014
See how the observed order of convergence is close to 1? This means that the rate at which the grid differences decrease match the mesh-refinement ratio. We say that Euler's method is of first order, and this result is a consequence of that.

How far away from 1.0 would the order of convergence have to be in order for someone to suspect a bug in the code? 1.1? 1.5? 2.0? Or should 1.014 itself be regarded with suspicion? Any test, automated or otherwise, must answer that question, but those answers are going to vary from domain to domain as well.

In theory, we can solve this by writing different lessons for different communities. In practice, that requires more resources than we have, and we'd still have to decide what to do in a room containing economists, microbiologists, and cosmologists.

I believe we can teach software testing to scientists, but I also believe that we have some work to do before we can do it effectively enough for most of our learners to put it back in Software Carpentry's core. What we can do to bring that day closer is start amassing examples of tests from different domains that include explanations of why: why these tests, and why these tolerances? You can see my attempt at something like this here, but that example deliberately doesn't use floating point so that the question of error bars didn't arise.

So here's my challenge. I'd like you to write some unit tests for the advance function in the n-body benchmark game and then share those tests with us. I don't care what language you use (source is available in several), or which unit testing framework you pick. What I want to know is:

Why did you choose the tests you chose, i.e., what kinds of errors are those tests probing for?
How did you pick your margin of error?

You can send us your tests any way you want, and I will happily send Software Carpentry t-shirts to the first half-dozen people to do so.

My thanks to Lorena Barba, Matt Davis, Justin Kitzes, Ariel Rokem, Fernando Pérez for feedback on an earlier draft of this post.

Categories: software-carpentry