Building Filters

Posted 2008-02-15

I decided earlier this week that the time had come to convert the Software Carpentry notes to a wiki to make it easier for other people to contribute. My decision was motivated partly by thinking about converting DrProject to use Markdown syntax for its wiki, and partly by the realization that I’m not going to have time in the next ten months to fix all the typos people keep pointing out, add new content, bring the examples up to date with Python 3000, and so on.

The first step was to pick a wiki syntax. That was easy: there are Markdown processors for Perl, PHP, and Python, several wikis support them, and my hands are going to be learning those typing rules anyway. The second step was to convert the existing notes, which are marked up in a homegrown XML format. This seemed like a good candidate for a classic Unix read-process-print-repeat filter, and sure enough, a few hours later, I have something working. I took notes as I did it; I’m posting them here as a record of how a moderately experienced developer tackles a routine problem.

Copy fifteen lines of code from one of the filters I use to turn the .swc XML files into HTML; this gives me something that parses XML to create an xml.dom.minidom tree in Python.
Write a recursive function that takes an output stream and a DOM node as inputs, and writes a representation of the latter to the former. If the node is a TEXT node, print its content to the stream; if it's an ELEMENT, switch on the tag, then recurse on its children. If it's anything else, print a warning to standard error and halt.
Fill in that switch (which in Python is a chain of if/elif/elif/… statements). Initially, each branch's body is just 'pass'; the 'else' clause prints the tag's name with stars around it. After running the 34 .swc files through this a couple of times, I have branches to fill in for all the tags I'm using. (No, there isn't an up-to-date DTD…)
After typing the shell commands to loop over all the .swc files a couple of times, I double back and put them into a Makefile. I use a pattern rule to say that %.txt depends on ../lec/%.swc; I'm not embarrassed about hard-coding paths, because this tool is only going to be used in this context. I also define a 'clean' target that gets rid of all the generated .txt files and other shrapnel.
Start filling in the branches. Some are easy: the '<t>' (text) tag has no analog in Markdown, while '<b1>', '<b2>', and '<b3>' (bullets at different levels) are just appropriate levels of indentation plus a star. Then I hit '<em>' (emphasis), which requires a closing tag after the children. No problem: I define a list variable called 'stack', append the text of the closing tag(s), then print those items in reversed order after iterating over the children.
Next is cross-references. The .swc file has '<scref id="intro"/>', which in HTML is converted to '<a href="intro.html">Introduction</a>': the word "Introduction" is taken from a lookup table that's built by a preprocessor that scans all of the .swc files and archives things like page titles, bibliography citations, glossary terms, and so on. I could either modify my existing script to read all the .swc files at once, extract this information, then process them, or write a separate preprocessor. Since I already have a preprocessor that does 200% of what I need (i.e., everything I'll need for this conversion, plus more), I copy that and chop out the bits that I don't need. Note that I don't have to think about the format for this extra information: the preprocessor builds it as a dictionary of dictionaries, then prints that object to a file. The SWC-to-DOM program then uses 'eval' to load that data (which is a legal Python expression).
OK, now I have cross-references, glossary items, and bibliography citations;that just leaves inclusions. The .swc files use '<inc path="…">' to include code fragments, and '<tbl path="…">' to include tables. (I chose to do the former so that all my code examples would still be runnable from the command line; I can't remember why I chose to do the latter, but it was overkill.) Code files are just lines of text; that easy. Table files are marked up with '<tbl>', '<row>', '<col>', and so on; after putting the XML reading code into a function (which I should have done off the bat), and adding a few more branches to the big switch statement, they're taken care of too.
I'm now generating text files that look fine to me. What will Markdown think of them? I add five lines to my Makefile to convert .txt files to .html using markdown.py, and… Oh. OK, the whitespace in the .txt files I'm generating is confusing Markdown. And my code fragments need to be indented. And I'd forgotten that Markdown doesn't directly support tables (they're an add-on). Mutter mutter fix fix fix… There. Half a dozen fixes to the SWC-to-Markdown script, and a little postprocessing to strip off extraneous newlines (it turns out to be easier to do this at the end than to keep track during translation of whether it needs to be done), and voila: the HTML is almost right. The few places where it isn't are things I'll take care of by hand, like double escaping of accented characters in people's names.

So what are the takeaways?

Real programming involves a lot of opportunistic bricolage (a fancy way of saying "re-using bits and pieces that are lying around, or can be torn out of wherever they are and re-purposed"). You can only do this effectively if you keep track of what you have, know your way around the standard libraries, and so on, but hey, if 15-year-old DJs can keep thousands of tracks at their fingertips for sampling, you ought to be able to as well.
I have no idea whether a read-process-print-repeat filter was the "best" way to solve this problem or not, and I don't care. I could immediately see how to fit my problem into that model, and I have enough practice writing such filters that I was confident I'd be able to deal with anything unexpected that came up. I could have done some up-front design, realized that I was going to have to deal with cross-references, and put together the tool that parses all of the files to extract link endpoints before doing anything else, but in this case, doing things in the "wrong" order probably didn't cost me any time. The more experienced you are, the more often you can work this way; remember, though, that experience comes from making mistakes…
My tool only solves the first 99% of the SWC-to-Markdown conversion problem. If I was going to release it to the world, I'd do the last 1%, and the X% after that (docs, an Egg for distribution, a page at the Cheese Shop, etc.). However, the Software Carpentry notes are the only .swc files in the world, so this is definitely the point of diminishing returns; the little bits that are left will be easy enough to fix up by hand.

Categories: programming, software-carpentry