I've been thinking some more about what the foundation and core of Software Carpentry actually are (and not just because Jon Pipitone keeps pestering me to do so). My last attempt had a foundation of seven principles and dozen topics in the core. I think I can slim that down even further; in fact, I think three big principles form the foundation of computational thinking:
- It's all just data, whose meaning depends on interpretation. This subsumes the notions that programs are a kind of data (which is the basis of things as diverse as functional programming and version control), and that we should separate models from views (because the most efficient ways for people and computers to interpret data are different). It doesn't really include the distinction of copy vs. reference, but I'm going to lump it in here because that idea doesn't seem big enough to deserve a heading of its own.
- Programming is a human activity. The only way to build large programs (or even small ones) is to create, compose, and re-use abstractions, because our brains can only understand a few things at a time. Similarly, good technique (specifically version control, testing, task automation, and some rules for collaborating, be they agile or sturdy) is necessary because everyone is tired, bored, or out of their depth at least once in a while.
- Better algorithms are better than better hardware. Computational complexity determines what's doable and what isn't, and no aspect of program performance makes sense without some understanding of it.
I also think we can reduce the core topics to just nine, though I can already hear protests from the back of the room about some of the omissions. I got this list by asking, "What's the minimum I think a graduate student needs to know to contribute to the computational work in a typical lab?" My answer is:
- The Unix shell
- Includes: basic commands (from
grep); files and directories; the pipe-and-filter model.
- Because: it's still the workhorse of scientific computing (and is experiencing a resurgence as cloud computing becomes more popular).
- Illustrates: "lots of little pieces loosely joined" is a good way to introduce modularity and tool-based computing; it lets us talk the human time vs. machine time tradeoff.
find; shell scripts (particularly
- Includes: basic commands (from
- Version control
- Includes: update/edit/commit; merge (with rollback as a special case).
- Because: it's a key technique.
- Illustrates: the idea of metadata; programming as a human activity (the hour-long red-green-refactor-commit cycle).
- Omissions: branching; distributed version control.
- The common core of programming
- Includes: variables; loops; conditionals; lists; functions; libraries; memory model (aliasing).
- Because: we can't teaching validation, associative data structures, or program design without this common core.
- Illustrates: programming as a human activity (programs must be readable, testable, etc.).
- Omissions: object-oriented programming; matrix programming.
- Includes: structured unit tests; test-driven development; defensive programming; error handling; data validation.
- Because: defense in depth is key to building large programs, and trustworthy programs of any scale.
- Illustrates: trustworthy programs come from good technique.
- Omissions: testing floating-point code (since we don't really know how to).
- Program construction
- Includes: piecewise refinement; refactoring; design for test; first-class functions; using a debugger.
- Because: knowing the syntax of a programming language doesn't tell you how to create a program.
- Illustrates: creating and composing abstractions; interface vs. implementation.
- Omissions: structured documentation.
- Associative data structures
- Includes: sets (as a prelude); dictionaries; why keys must be immutable.
- Because: useful in so many places.
- Illustrates: how the right algorithms and data structures make programs more efficient.
- Omissions: implementation details.
- Includes: select; sort; filter; aggregate; null; join; accessing a database from a program.
- Because: useful in many contexts.
- Illustrates: separation of models and views; a different model of computation
- Omissions: sub-queries; object-relational mapping; database design.
- Note: we could illustrate many of the same ideas with spreadsheets, but they're not as easy to connect to programs.
- Development methodologies
- Includes: agile practices (the usual Scrum+XP mix); sturdy (plan-driven) lifecycles.
- Because: ties many other lessons together.
- Illustrates: good technique makes good programs.
- Omissions: code review.
If we use a two-day workshop to start, and follow up over six weeks with one lesson per week, I think we can cover:
files and directories
|2.||Version control||update/edit/commit; merge||rollback|
|3.||Core programming||all of it (but see below)||not needed (but see below)|
|4.||Validation||unit tests; TDD||defensive programming; error handling;
|5.||Program construction||One extended example;
one demo of a debugger
|More examples; design for test;
|6.||Associative data structures||none||everything|
|8.||Development methodologies||overview of agile||sturdy (plan-driven) lifecycle;
evidence-based software engineering
Topic #3, core programming, is the hardest to manage. If people have programmed in Python before, it can be a quick review (or omitted altogether). If they've programmed in some other interactive language, it can also be covered pretty quickly, but if they've never programmed before, or took one freshman course ten years ago, there's no way to teach them enough to make a difference in half a day. Even if there was, the other learners would undoubtedly be bored. The only solutions I can see are to restrict participation to people who can already do a simple exercise in some language, or to run one day of pre-bootcamp training for non- or weak programmers. Neither option excites me...
Coming back to content, this plans means that we'll leave out a lot of useful things:
- Spreadsheets: lots of scientists use spreadsheets badly, but while we'd like to show them how to do so well, the only one they actually use, Excel, isn't open source or cross platform, and it's much harder to build programs around spreadsheets than around databases.
- Make: is very hard to motivate unless people are working with compiled languages—we've tried showing people how to build data pipelines using Make, but it's too clumsy to be compelling. Plus, Make's syntax makes a hard problem worse...
- Systems programming: knowing how to walk directory trees and/or run sub-processes is useful, but we think people can pick these up on their own once they've mastered the core.
- Matrix programming: really important to some people, irrelevant to others, and the people it's important to will probably have seen the ideas in something like MATLAB before we get them.
- Multimedia programming (images, audio, and video): people can learn basic image manipulation on their own; audio and video are harder, mostly due to a lack of documentation, but they aren't important enough to enough people to belong in our core.
- Regular expressions: are a great way to illustrate the idea that programs are data, and are very useful, but everything in the core seems more important, and it'll be hard enough to get through all that in the time we have. This is probably the one I most regret taking out...
- HTML/XML: there are lots of excellent tutorials on writing HTML, and while XML processing is a good way to introduce recursion (and, if XPath is included, to talk about programs as data once again), I believe once again that it's not important enough to displace any of the material in the core.
- Object-oriented programming: is probably the omission that raises the most eyebrows. We can introduce it fairly naturally when talking about design for test (more specifically, about interface vs. implementation), but in practice, most people get along fine using lists, dictionaries, and the classes that come with the standard library without creating new classes of their own. Plus, showing people how to do OOP properly takes a lot more time than just showing them how to declare a class and give it methods.
- Desktop GUIs: an excellent way to introduce reactive (event-driven) programming and program frameworks, but is less important than it was ten years ago (most people would rather have a web interface these days).
- Web programming: the only thing we can teach people in the time we have is how to create security vulnerabilities.
- Security: the principles are easy to teach, but translating them into practice requires more knowledge (especially of things like web programming) than we can assume our learners have.
- Visualization: everybody wants it, but nobody can agree what it means. Should we show people how to use a specific library to create 3D vector flows? Or the principles of visual design so that they can make nicer 2D charts? And no matter what we teach, will they actually learn enough to make a difference?
- Performance and parallelism: the most important lesson, which is in the core, is that the right data structures and algorithms can speed programs up more than any amount of code tuning. Everything after that is either inextricably tied to the specifics of a particular langauge implementation (performance tuning), or offers no low-hanging fruit (parallelism). The one exception is job-level parallelism, which could be included in the material on the Unix shell if an appropriate cross-platform tool could be found.
- C/C++, Fortran, C#, or Java: more to introduce fixed typing and compilation, but these are relatively low priority topics.
We're going to start implementing this plan (or some derivative of it) at the beginning of February, to be ready for workshops starting at the end of that month. We'd welcome feedback; in particular, have we taken something out of the core that you think is more important than something that's in, and that could be taught in the time that's actually available? If you have thoughts, please let us know.
This post originally appeared in the Software Carpentry blog.