Archive

Archive for February, 2006

Second Lecture on Object-Oriented Programming

February 21st, 2006
Comments Off

The second lecture on object-oriented programming is now on the web. This describes operator overloading and static methods, and includes the design patterns material that was in the old design lecture (which has been removed—the general consensus was that it didn’t work). As always, comments are welcome.

Software Carpentry

DemoCamp: Googling for People

February 21st, 2006
Comments Off

People talk a lot about on-line communities, but there are things that only the in-person kind can accomplish. Take last night’s DemoCamp 3, for example. Ninety-plus people crowded into TUCOW‘s offices in Toronto to watch half a dozen demos projected against a white-painted brick wall with a nail in it. The most important part for me, though, came afterward.Sean Dawson and I led off with DrProject, and were followed by:

The demos ranged from earnest to credible; there were plenty of questions after each, and the whole show ran just under two hours.

We adjourned to the pub afterwards, and that’s where the most important part of the evening (for me) got under way. Six of my former students were there, and with the exception of Michelle Levesque, I think it was the first time any of them had seen a bunch of people working a room:

“Hi, I’m with BubbleShare, which is like Flickr, but easy to use. How about you?”

“I work for Idee; we’re a visual search company—kind of like Google for images. We’re looking for someone to do QA.”

“Really? I think that guy over there does QA. Just met him; want me to introduce you?”

It’s a little intimidating the first time you—oh, who am I kidding? It’s a lot intimidating the first time you try to paddle around in that particular pool, so I wasn’t surprised that my former students huddled together and talked about World of Warcraft. But all around them, they could see people googling for other people. All around them, they could see people practicing the one talent that anyone who wants to change the world needs.

What are you selling? What do you want to buy? What do you want to do? What kind of person are you? Could we work together? Would we inspire each other, or push each other in new and profitable directions? From bazaars in ancient Sumeria to sleazy mob hangouts in St Louis with peanut shells on the floor and .38 caliber holes in the walls, the dance has stayed the same. I’m damned if I know how to teach it, but I’m grateful to the folks at TUCOWS for giving my younglings a chance to learn.

Uncategorized

DrProject: Switching to Kid

February 20th, 2006

Chris Lenz, Jason Montojo, and I began work on refactoring DrProject in early January. One of the first decisions we made was to replace the Clearsilver templating framework with Kid, an XML-based alternative. Now that the work is done, we’ve learned a few things about Kid that others might find useful.

Why did we abandon Clearsilver? First, its templates are not valid XML documents, making maintenance very difficult. If you have ever had to modify someone elses clearsilver template, you will already know it’s difficult at best. Second, Clearsilver is not Pythonic: when passing data into the template, you first have to preprocess it into a pseudo-dictionary of strings, which means you have to process your data twice: once for the preparation phase, and then again when the template is being rendered. Finally, since you cannot access Python functions and objects from within the template, you have to
execute many UI-related functions in the controlling layer, rather than in the template, which blurs the separation between controller and view.

After looking at a few alternatives, we settled on Kid as a replacement. At first glance, it seemed like a perfect solution: Kid templates are guaranteed to be well-formed XML, and you can pass Python data structures and objects to the template for use in the rendering stage.

Once we eventually finished porting the view layer to Kid (a non-trivial process which I will describe in an upcoming post), the end result was cleaner controlling code and cleaner templates, which will be significantly easier to maintain.

But Kid isn’t perfect (what is?). There are many problems and “gotcha’s”, which I have been documenting. Most of these issues are minor, and only ever catch the developer once. Rendering speed, however, is turning out to be a very significant problem. Simply put, Kid is slow. In my tests, the rendering phase of a single web request is approximately 2-3 times longer than the processing phase, which includes many database seeks. You can see the difference by running this simple test:

#!/usr/bin/python
#

import timing

timing.start()
data = ['Number <em>%s</em>' % x for x in range(100)]
timing.finish()
process_time = timing.milli()

source = """
<html xmlns:py="http://purl.org/kid/ns#">
<head>
</head>
<body>
<table>
<tr py:for="x in data">
<td>${XML(x)}</td>
</tr>
</table>
</body>
</html>
"""

import kid
timing.start()
template = kid.Template(source=source, data=data)
content = template.serialize()
timing.finish()
print "Processing time: %d, Rendering time: %d" % (process_time, timing.milli())

which results in: Processing time: 0, Rendering time: 1759

This performance is almost shockingly poor. The problem appears to be a side-effect of guaranteeing the template is well-formed XML: when you remove the XML(...) fragment from the template, and just display x, the rendering time drops to 129 milliseconds.

There has recently been some talk on the Kid mailing list calling for an option to disable the “well-formed XML” check when embedding XML into a template. Hopefully for DrProject, this change gets pushed into Kid very soon. In the meantime, if you have experienced similar performance issues with Kid and have found a workaround, please email me.

DrProject

AAAS Annual Meeting 2006

February 20th, 2006
Comments Off

Wednesday, 11:10 p.m.: phone call from Air Can’tada saying that my Thursday morning flight to St Louis has been cancelled because of bad weather. Next available is 4:00 p.m. Friday afternoon—two and a half hours after my workshop is due to end. No, they can’t help me find an alternative carrier. Expedia can, though, and by midnight, I have a ticket on Delta, via Cincinnati.

Thursday, oh dark hundred: the cab’s tires crunch through eight centimeters of fresh snow on the way to the airport. We’re late getting off the ground, and even later leaving Cincinnati, but at least we’re airborne. Tornado warnings over St Louis, though, so after circling over a spinning mass of clouds with a lightning-filled depression in the middle for about an hour, we head for Evansville, Indiana. I finally get to my hotel at 7:30 p.m., fifteen hours after starting my day.

Friday: the Annual Meeting of the AAAS isn’t really a scientific conference—it’s a place for science advocates to gather and plot, stirred together with an extended series of press cuddles dolled up as seminars. (This is not a criticism: if the cosmetics industry, fast food vendors, and the military-industrial complex are smart enough to plot and cuddle, scientists should be too.) Some of the talks (particularly the medical ones) are Mojave-dry, but others are pretty cool:

  • “The Demography of Black Holes” (with pictures!)
  • “In Search of Genes that Influence Language” (without, but still interesting)
  • “New Approaches to Paleontological Investigation” (use a CT scan of a fossil to drive a 3D lithography machine, and you can photocopy dinosaur bones at sub-millimeter resolution—oh, and check out www.digimorph.org)

Friday noon: Andy Lumsdaine and Peter Gottschling arrive from Indiana University for our workshop on Essential Software Development Skills for Research Scientists. We covered the usual topics:

  • Computational scientists don’t pay as much attention to quality and reproducibility as experimental scientists (in fact, many of them don’t pay any attention to these issues).
  • Most scientific programmers are woefully inefficient compared to their industrial counterparts, largely because no one has ever taught them basic software engineering skills.
  • A handful of tools and techniques can reliably improve scientific programmers’ productivity by 20-25%: version control, test-driven development, continuous integration, issue tracking, use of a debugger, enforcing style, traceability, and behind them all, automation.
  • There are many personal and institutional obstacles (ranging from “I have a degree in physics, so programming must be easy” to “journals and tenure committees don’t care, so I can’t afford to”).
  • We either fix this ourselves, proactively, or someone else will legislate bad rules in the wake of a very public disaster.

Randy Heiland’s picture shows the three of us on stage; there weren’t as many lab managers or funding directors as I’d hoped for, but lots of good questions and discussion.

Friday evening: a recap of the 2005 Ig Nobel Prize awards for science that cannot, or should not, be repeated, including:

  • Physics: John Mainstone and the late Thomas Parnell, for patiently conducting an experiment that began in the year 1927, in which a glob of congealed black tar has been slowly, slowly dripping through a funnel, at a rate of approximately one drop every nine years.
  • Medicine: Gregg A. Miller, for inventing Neuticles—artificial replacement testicles for dogs, which are available in three sizes, and three degrees of firmness.
  • Literature: the Internet entrepreneurs of Nigeria, for creating and then using e-mail to distribute a bold series of short stories, thus introducing millions of readers to a cast of rich characters, including General Sani Abacha, Mrs. Mariam Sanni Abacha, Barrister Jon A Mbeki Esq., and others.
  • Peace: Claire Rind and Peter Simmons, for electrically monitoring the activity of a brain cell in a locust while that locust was watching selected highlights from the movie Star Wars.
  • Economics: Gauri Nanda, for inventing an alarm clock that runs away and hides, repeatedly, thus ensuring that people DO get out of bed, and thus theoretically adding many productive hours to the workday.
  • Biology: Benjamin Smith and others, for painstakingly smelling and cataloging the peculiar odors produced by 131 different species of frogs when the frogs were feeling stressed.
  • Fluid Dynamics: Victor Benno Meyer-Rochow and Jozsef Gal, for using basic principles of physics to calculate the pressure that builds up inside a penguin, as detailed in their report “Pressures Produced When Penguins Pooh—Calculations on Avian Defaecation.”

Saturday: I smorgasboarded the seminars. The best was Latanya Sweeney‘s talk about information privacy—she was kind enough to chat with me for 45 minutes afterward about undergraduate curriculum reform, and the obstacles to it (did you know there isn’t an undergrad course on software engineering at CMU?). The worst was an unrelated seminar on “Information Security in Public Databases”. Aaron Emigh, of Radix Labs, did a great job of explaining the issues. Kevin Fu, of UMass, was also engaging, but Mike Szydlo (RSA) gave us a technical sales talk that I’m sure went over the heads of most of the audience.

And then there was Markus Jakobsson, of Indiana University. He’s the guy who conducted phishing attacks on IU students last year, without their prior consent (informed or otherwise), in order to get material for a paper. I think this was irresponsible: one of the obstacles to better security is that the public doesn’t trust us (the professionals) to look out for their interests. Some of that is Hollywood’s fault (how many positive portrayals of computer geeks have you seen recently? and how many portrayals of what hackers can and can’t do are half as accurate as the average episode of a medical soap opera?), but conducting experiments on people who don’t know they’re being experimented on sure doesn’t help.

One telling moment came after the presentations, when Jakobsson asked the audience which of two “solutions” they thought would work better: educating the public, or better technology. I pointed out that what he was really offering users was a choice between paying more (hours) or paying more (dollars, to technology vendors). I then asked why he hadn’t mentioned the third option, which is to shift the financial pain to the vendors (which is what brought the problem of credit card fraud under control). He dodged, but Aaron Emigh didn’t, so I’m going to see if I can get Aaron’s slide set, and post it here.

Saturday afternoon: discover that there are no bookstores in downtown St Louis. I don’t mean, “there are no good ones”. I mean, “there are no bookstores in the downtown core of St Louis, at all”. The nearest (according to both hotel staff and conference organizers) is a 15-minute drive away—in another county.

Sunday: up at quarter to five to get to the airport for a 7:15 flight that didn’t take off until 8:45, which meant that I missed my 10:59 in Cincinnati, and had to get the 1:10 instead, so I didn’t get home until 3:20. Very happy to walk through the door; very happy to have someone else happy that I was walking through the door.

Software Carpentry

Reminder: DemoCamp3 in Toronto

February 16th, 2006
Comments Off

From Joey deVilla, a reminder about DemoCamp3 in Toronto on Monday. (And from me, fatigue: Air Canada called at quarter after eleven last night to tell me that my flight to St Louis this morning was canceled. I found space with Delta, but thanks to a few tornadoes, it took them three tries to get me here. The Software Carpentry workshop is tomorrow — wish me luck.)

DemoCamp

Two Links via the Accordion Guy

February 15th, 2006
Comments Off

Entry-Level Requirements Engineering Revisited

February 15th, 2006
Comments Off

Try googling for “open source” “requirements engineering” or “open source” “requirements management”. Lots of links, but nothing that leads to a mature (or even adolescent) open source requirements engineering tool that would help me keep track of:

  1. what I’m supposed to be building;
  2. where that requirement came from (i.e., who I have to talk to if I want to change it, or to get more information);
  3. whether the requirement has actually been implemented; and
  4. whether that implementation has actually been tested.

There are lots of commercial tools in this space, including:

but most of them are just glorified to-do list managers. Some integrate with Microsoft Office tools, so that users can edit requirements documents in Word or Excel, while others integrate with CASE tools, but for the most part, they all assume that someone, somewhere, is going to type in a whole bunch of itemized, organized, point-form requirements, and then update them regularly as the project progresses.

Which, in my experience, just doesn’t happen—at least, not in the domains I’m most interested in:

School Open Source
Customer prof nerd
Developer student nerd

XP’s response is to abandon long-range requirements management in favor of short-range user stories. If your customer’s marketing department doesn’t need to know what they’re going to have to sell next year, and if they’re willing to pay for a lot of duplicated effort (sorry, refactoring), that’s great, but I’ve never worked under those conditions.

On the other hand, I have seen projects hum along for several years, hitting deadline after deadline, once sensible requirements management practices were put in place. However, that only happened after the group in question had been through a couple of death marches. This leaves me with a conundrum: how to convince people (particularly students) that RE will pay of, without making them jog a few laps of Hell?

The option I’m most interested in is to lower the entry cost of RE. All of the tools described above (with the exception of GatherSpace) have a hefty cover charge: you have to invest a lot in them before you get anything out. Students working on three-week (or even term-long) assignments won’t ever reach those tools’ the payoff points; if we say, “You have to use this to get a grade,” they’ll probably come away thinking even less highly of RE.

So, what does an entry-level RE tool look like? Something you can learn in a one-hour tutorial (or less), that will make your life easier the second time you use it? Last summer, Bin Liang explored the possibility of adding a two-pane display to DrProject that would let students connect requirements (created by the prof, in the usual point-form way) to JUnit unit tests, so that they could see which parts of the assignment they’d actually completed.

It was a nice idea, but in the end, we decided it wouldn’t be compelling: profs would have to put a lot more time into creating their assignments, while students wouldn’t see much benefit (anyone can keep track of half a dozen requirements in their head).

Bin then went on to explore something more promising, something that wouldn’t require any extra work from either profs or students. Take an assignment, and throw away the stop words (like “the” and “or”) to create a set of keywords with locations. Now take a bunch of JUnit tests, and break the names of the classes, methods, and variables they contain into words (by splitting on underscores, or on CamelCaseBoundaries) to get another set of located keywords. What happens when you correlate the two?

As Jane Huffman Hayes and her colleagues found out in a slightly different context, the results are actually pretty good: standard information retrieval algorithms will match code to requirements and vice versa. Not as well as a human being would, maybe, but certainly well enough to be worth exploring further.

There are lots of directions we could go with this, and other people are already blazing the trail. Requirements Assistant, for example, looks for ambiguous or contradictory phrases in an requirements document; I could see combining the two approaches to try to find places where the code doesn’t implement the requirements. But in the near term, here’s a proposal:

On the left, we have the assignment handed out by the prof (format to be determined).

On the right, we have the student’s code.

In the middle, we have a tool which matches assignment (requirements) to code, and which highlights bits of code that no part of the assignment maps to. TAs use this when marking: they click on Part A, B, or C of the assignment spec, and it shows them what (if anything) they should be marking. They *only* mark the things that the tool finds: if they click on Part C, and the tool says, “I can’t find corresponding code,” they give the student 0 for that part of the assignment.

When they start the assignment (or course), students are shown the tool, and told, “TAs will use this when grading. You can run it too, before you submit your work, in order to see what the TA will see. Please note that if the TA clicks on a bit of the assignment spec, and the tool can’t find any matching code, they’ll give you zero for that part of the assignment, since clearly it’s unfair to ask them to waste time hunting around in your code for something that you could have made plain and clear.”

Students now have an incentive (a) to learn how to drive the tool, and (b) to add whatever extra information the tool requires (and/or to structure their code the way the tool wants). The carrot is that they have more insight into the grading process: as with submitting code to run against the prof’s test suite, and getting a preliminary score back, they can *see* which bits of their code the TAs are going to be marking to satisfy each bit of the assignment spec.

So, what do you think? Would you have felt put upon as a student if someone deployed this in a class? Do you think you would have learned something from it? Are there other entry-level requirements engineering tools that might have fit your needs better? I’d be very interested in hearing your ideas.

Research

Data Lineage

February 14th, 2006
Comments Off

The January 2005 issue of ACM Computing Surveys (vol. 37, no. 1, if you prefer) has good review by Rajendra Bose and James Frew titled “Lineage Retrieval for Scientific Data Processing: A Survey”. In it, they look at what scientists do to keep track of what data they have, where it came from, and what has been done to it. Some of my students last term were worrying about the same issues in the context of HL7 medical data. It seems like an ideal place for software engineers to apply their skills: I’d be interested in hearing from people who have home-grown or small-scale systems I could use as a starting point for a lecture in Software Carpentry.

Research, Software Carpentry

Lecture on Binary Data

February 14th, 2006
Comments Off

The Software Carpentry lecture on binary data is now up on the web. The content of this one has been fairly stable for a while, but that just means that all the bugs will be in the details—comments and corrections are greatly appreciated.

Software Carpentry

Reminder: Toronto DemoCamp 3 is Next Monday

February 13th, 2006
Comments Off

Toronto’s DemoCamp 3 is next Monday (Feb 20). We’ll be showing off DrProject; hope to see you all there.

DemoCamp