Archive

Archive for October, 2006

Computational Result Retracted

October 31st, 2006
Comments Off

From the latest Nature (Vol 443, No 7114, p 1013):

When a new, independent code is used for the calculations on which the conclusions of this Letter were based, the results reported for the evolution of obliquity cannot be reproduced. This code was written in the inertial frame and is more reliable than the one used in the Letter. In most runs, the obliquities can change by only a few degrees and attain large values in only a very few cases. In addition, the obliquity variation shown in the Supplementary Information, although correct, originates from changes to orbital inclination of the planet, and close encounters are not effective in causing large obliquities.

In other words, the code was flaky, so the results we published were wrong. Kudos to the author for trying to verify his result with another program—I’m sure a lot of computational scientists would be very embarrassed if they had to do the same. But I wonder: why does he believe the “new code” is more reliable? Surely not just because it uses a more sophisticated method, or gives a less surprising answer…

(Thanks to Andrew Straw for the pointer.)

Software Carpentry

Jim Waldo: “On System Design”

October 30th, 2006
Comments Off

There is an idealized view of academic research in which that research takes greater risks than industry, plans for the longer term, and is less concerned with the commercial success of a research effort than in the intellectual content of the research.  On this view, academic research can take a longer view than industrial research and development, and can take on higher-risk questions since even negative results can add to the base of knowledge that is the goal of academia.  When a research program does pan out, the results can be transferred to industry for further development, and the academic researcher can turn to the next big question.  Along the way, graduate students are trained in methods of research and techniques of system design, and when they are done they can either join the industrial world or return to academia to continue long-term research and the training of the next generation of graduate students.

Those believe this will also clap for Tinkerbell.

— Jim Waldo, “On System Design”, OOPSLA’06

Research, Teaching

Build a Better Voting Machine

October 30th, 2006
Comments Off

I don’t bother with Wired much any more, but this piece titled “Build a Better Voting Machine” is great.  Berkeley’s David Wagner and Princeton’s Ed Felten (who used to write computer chess programs, many years ago) have looked at what’s wrong with today’s machines, and described a better one.  I’m going to recommend it to my Software Architecture class…

Uncategorized

DrProject Internals: Parting Notes on the Wiki

October 30th, 2006

I promised in the last article to move on to DrProject‘s ticketing system, but there are still a couple of issues around its wiki that need further description. The first is how wiki text is transformed into HTML; the second is why this is harder to do in batch mode than you’d think.

Ward Cunningham created the first wiki in 1994-95 so that people could easily edit hypertext over the web. The only input widget he trust browsers to support was the text input box. Since editing HTML tags by hand is tedious and error-prone, he adopted and modified the notational conventions that people were using for email and other plain-text documents:

== Level 2 ==

Level 2

''emphasis'' emphasis
ThirdBit ThirdBit
[http://www.third-bit.com home] home

If wikis had been invented later, these text notations might not have been adopted, since it’s now straightforward to nest a WYSIWYG HTML editor written in Javascript in a web page. On the other hand, there are a lot of out-of-browser cases where systems like DrProject want users to be able to create hyperlinked text, such as comments on check-ins to version control and the bodies of email messages, so perhaps ASCII styling would have arisen after all.

Wiki markup simplifies input, but creates a new problem: somehow, pages have to be parsed, then transformed into HTML for display. The parser that DrProject inherited from Trac used a pile of regular expressions to match and extract bits of text. This worked well enough, but was difficult to extend—every time we tried to add a new feature, we found our new regexp conflicted with some of the existing ones, or made some piece of formatting ambiguous.

A graduate student named Liam Stewart (now with Idée) therefore wrote an entirely new parser using Greg Ewing‘s Plex toolkit. Like yacc and (many) other parser generators, Plex takes a grammar specification as input, and compiles it into executable code that will parse the language that grammar specifies. Developers can embed instructions in the grammar to create a parse tree as a side effect of analyzing input. Here’s a bit of our grammar (Bol means “beginning of line”, Eol means “end of line”, and—well, the details aren’t important):

newline = Str("\n", "\r\n")
h1 = Alt(Bol + Rep(Any(" \t")) + Str("="),
Str("=") + Rep(Any(" \t")) + Alt(Eol,newline))
h2 = Alt(Bol + Rep(Any(" \t")) + Str("=="),
Str("==") + Rep(Any(" \t")) + Alt(Eol,newline))
hr = Bol + Str("----") + Rep(Str("-"))
br = re('\[\[BR\]\]')
ital = re("''")
bold = re("'''")
boldital = re("'''''")
under = re("__")
...
rules = [
(h1, (HEADING, H1)),
(h2, (HEADING, H2)),
(hr, (HR, None)),
(ital, (STYLE, ITALIC)),
(bold, (STYLE, BOLD)),
(boldital, (STYLE, BOLDITALIC)),
(under, (STYLE, UNDERLINE)),
...
]

The entire parser came in at 459 lines of Python, including blank lines and comments. We were pretty pleased—until we noticed that DrProject had slowed down a lot after we introduced it. An undergraduate student named Billy Chun looked into the problem over the summer, and discovered that Plex was re-generating the parser every time it was called. More specifically, Plex was spending 1.9 seconds translating a nondeterministic finite state automaton (NFA) into its deterministic equivalent (a DFA) each time DrProject rendered a wiki page.

This isn’t a bug in Plex, but rather a misconception on our part. Yacc and similar tools produce C code that can be compiled into an application; we therefore assumed that Plex was doing the same, i.e., that the parser was being produced once and stored somewhere for re-use. When that turned out not to be the case, we were stymied: we went as far as trying to serialize the byte codes that Plex generated before deciding that we were creating more complexity than we had eliminated by switching to a parser generator in the first place.

Our eventual “solution” to this problem was to let another change to the system—the switch from pure CGI to SCGI—take care of it for us. While pure CGI forks a fresh Python interpreter to handle each HTTP request, SCGI relies on a single interpreter running in parallel with the web server. That long-lived interpreter only needs to generate the parser once; the second and subsequent times it has to render a wiki page, it can re-use the parser it generated the first time.

As Billy was starting his investigation of DrProject‘s performance problems, two other students named Apple Viriyakattiyaporn and David Scannell were ramping up as well. We’d hired them to spend the summer doing development on the system; as a warm-up exercise, I asked them to write a standalone wiki formatter. The idea was to have a command-line tool that would transform the text of a wiki page into HTML, so that people could check their formatting when editing offline, batch process pages for printing, and so on.

It turned out to be a harder job than I’d expected. Some of the difficulty was simply tangled code: for example, the wiki processor expected an HTTP request object as input, but we obviously wouldn’t have one when running from the command line in batch mode. No problem—Python’s duck typing makes it easy to create objects that implement just enough of an interface to get a particular job done.

The real problem—what Fred Brooks Jr. would call the “intrinsic complexity”, as opposed to the “accidental complexity”—was how to format links. In order to decide whether to format a particular CamelCaseWord as a link or not, the wiki processor needed to know what other pages the system contained. That information was stored in DrProject‘s database. Should the standalone wiki processor depend on access to a database? That would be inconvenient; it would also require us to give users read privileges for the database, which would be a big security risk.

What about decoupling the wiki processor from the database? If the wiki processor only ever called a function get_page_names, we could import a database-specific version in DrProject, and something else in the standalone wiki processor. That “something else” could read from a file, take a bunch of page names as command-line arguments, or anything else.

That sounded sensible—until we started thinking about wiki macros. A macro is a piece of Python code that generates HTML. If copyright.py is saved in the right directory, for example, and contains a function called copyright() that returns the string "Copyright © 2006 University of Toronto" then putting [[copyright()]] in a wiki page will insert the copyright notice. Trac‘s Macro Bazaar lists dozens of macros for navigation, inclusion, charts, and much else.

How is the standalone wiki processor supposed to handle macros? Loading the Python code that implements a particular macro isn’t (much of) a problem; the problem is that there’s no way of knowing what data a particular macro is going to want. Something that displays the last time a wiki page was edited, for example, is going to need access to an actualy DrProject database. (Either that, or we create a mock object that has all the features of such a database, which seems a little silly…)

In the end, we punted: after working on the standalone wiki processor for more than a week, Apple and David had learned enough about DrProject‘s internals to start fixing bugs and implementing new features, so we put set the standalone wiki processor aside. I’d like to return to it some day, though; being able to batch process wiki pages and tickets (particularly for printing) is still an attractive idea.

DrProject, Teaching

DrProject Internals: Security Part 2

October 28th, 2006

The previous article in this series introduced a simple security model based on authentication, authorization, and access control, then described how DrProject implements the first of these. That still leaves two important pieces, though: how to represent who’s allowed to do what, and how to enforce those rules.

The key concept here is that of a permission, which combines a user profile (e.g., an account) with the capability to perform some action. A simple authorization system associates every operation with a capability, such as EDIT_WIKI_PAGE or DELETE_USER_ACCOUNT. When a user U attempts to perform an operation that requires a capability C, the authorization system looks for the pair (U,C) in its database (Figure 1). If that pair exists, the operation is allowed; if it doesn’t, the operation is denied.

But what about the thing the user is operating on? A filesystem can’t simply say that Alan is allowed to view files; it has to make that decision on a file-by-file basis. Should permissions therefore be subject-verb-object triples?

We thought about doing this in DrProject, but decided against it. There might actually be situations in which we’d want to control which wiki pages a particular user could edit, but we couldn’t think of any. Plus, the fact that we’re saving old versions of pages means that if anyone ever does something we don’t like, we can undo it [1].

DrProject‘s authorization database does actually contain triples, though, and the reason is that each installation may support multiple projects. I’ll discuss this in more detail in a future posting, but essentially, each project is a logically separate entity, with its own independent collection of wiki pages. DrProject therefore has to record (U,P,C) triples, where U is the user, P is the project, and C is the capability. The table that stores wiki pages also has an extra column to show which project a page belongs to (Figure 2).

We could stop here—DrProject‘s predecessor Trac did [2]. However, if an installation has 20 projects and 60 users, and that six capabilities are required to define what can be done to the wiki, the system administrator potentially has to manage 7200 distinct permissions. Figuring out who can do what (or worse, who was able to do what, when) quickly becomes an administrative nightmare.

In practice, permissions are usually granted in a few stereotypical ways. Someone who can create pages can almost always edit and delete them; someone who can’t create user accounts usually can’t change their settings, either. DrProject makes this concrete by using a level of indirection in its authorization system. A role is a uniquely named (and possibly empty) set of atomic capabilities. Each user has a role with respect to each project. The number of distinct roles is surprisingly small: the DrProject we’re using to manage undergraduate projects this term only has three, and I’d be surprised if any system ever needed more than a dozen.

Implementing roles is a little bit trickier than it first seems. To see why, think about how the system defines what someone who hasn’t yet logged in is able to do. We represent this by creating a user profile for the anonymous user, and then defining the anonymous user’s role with respect to each project.

That part is simple. What isn’t simple (or at least, it didn’t appear simple to me) is that a user’s actual permissions must be the union of the permissions defined by her role (if any) with respect to a project, and the permissions that the anonymous user has for that project. If this weren’t true, then it would be possible that a user could accomplish more by logging out than by being logged in.

The last thing we have to do with roles is decide which of them count as “being a member” of a project. People use this phrase all the time, but flattening a complex collection of capabilities into a Boolean decision is actually quite hard. To jump ahead a bit, it seems sensible to say that only the members of a project are automatically on that project’s mailing list, but does that mean someone whose role allows them to file tickets, but not to update or work on those tickets, gets mail or not? What about someone who’s allowed to view and comment on the wiki, but not allowed to create new pages? We played around with various rules, and finally decided that the simplest thing was to throw out the idea of “membership” per se. If a role has any capability in the set {MAIL_POST, MAIL_VIEW, MAIL_DELETE}, messages sent to the project mailing list will be forwarded to users with that role.

After all this, access control is fairly straightforward to implement. Every time it receives an HTTP request, DrProject constructs a request object that stores references to user profile and a project descriptor (Figure 3). The user and project IDs are used to look up a role; that role’s capabilities are merged with those that the anonymous user has with respect to that project to create a description of what the user is allowed to do. Every method that requires a capability then checks that its capability is in that set, and throws a permission exception if it’s not.

Next time: tickets.


[1] Well, almost. DrProject‘s wiki does allow users to completely erase all record of a page from the database. If a malicious or clumsy user does this, the only way to recover the page is from the last system backup. We may get around this in future by associating an “existence” flag with each page to signal whether it should be treated as being there or not, but this is a pretty minor concern.[2] Trac was actually a little smarter than this, but only a little. Trac managed authorization by storing raw (user, capability) pairs, such as (ghopper, WIKI_EDIT). Both fields were strings; to determine a user’s actual capabilities, Trac recursively expanded the “raw” capabilities by looking for them in the USER column. For example, if ghopper had the capability ADMIN, and the “user” ADMIN had the capabilities WIKI_EDIT and WIKI_DELETE, then ghopper was given those capabilities as well. DrProject‘s strict two-level separation of users from roles, and roles from capabilities, has been a lot easier to administer, test, and debug.

DrProject, Teaching

Adam Goucher’s “QA 101″

October 28th, 2006
Comments Off

Adam Goucher is the best tester I’ve ever worked with.  He is now teaching an introductory course in Toronto called “QA 101“, which includes discussion of how to automate routine setup and testing with Python.

Announcements

DrProject Internals: Security Part 1

October 27th, 2006

Last time around, I described the architecture of a very simple wiki system that stored pages, along with their histories and meta-data, in a database, and let users view and edit those pages over the web. In an ideal world, the next step would be to add either a work ticketing system, or an interface to version control.

But our world is far from ideal: there are pranksters out there, and spammers, and outright villains, too. The next step in our rational reconstruction of DrProject is therefore to worry about security. Bitter experience shows that it’s hard—some would say impossible—to make systems secure after the fact. Security must be designed in, and tested, right from day one.

The simplest useful model of security breaks the problem down into authentication, authorization, and access control. Authentication is the process of binding a session to a stored identity. This is not the same as establishing who the user is: people can easily share user IDs and passwords, and even biometric systems can be spoofed. Instead, authentication takes something the user is, has, or knows, like a fingerprint, password, or smart card, and figures out which of the user profiles stored in the system it corresponds to.

Authorization is the determination of who can do what. Can user X read file Y? Can she append data to it? Delete it? Give someone else permission to do any of these operations? Lastly, access control is the enforcement of these rules; it’s what prevents X from reading things she’s not allowed to, no matter how cleverly she asks.

The first thing this simple security model does is help us think about possible attacks. To break in, an attacker must:

  • convince the system she’s someone she’s not;
  • get permissions she isn’t supposed to have; or
  • bypass the controls that are supposed to prevent her from doing something.

Secondly, this model can help us build a domain model—an abstract picture of what a security system needs to contain. Here are some of the concepts we have so far:

  • User profile: a unique electronic identity, such as a login account. A single human being may have many profiles in the system; many human beings may have access to a single profile.
  • Credentials: what an actual user is, has, or knows that binds her to a user profile.
  • Authentication mechanism: something that finds the user profile corresponding to a set of credentials. Every authentication mechanism needs a way to say, “I don’t recognize these credentials.” This is usually done either by signalling an error, or by returning a specially-marked user profile.
  • Capability: something that can be done to something in the system, such as reading the contents of a particular file, changing the “last modified” time of a particular directory, deleting a particular user profile, and so on. The word “particular” is important here, because the system needs to distinguish the capability of deleting file X from the capability of deleting file Y (or all files).
  • Permission: a pairing of a user profile and a capability, i.e., a representation of user X’s right to perform operation Z.

Let’s start with user profiles. Most web-based systems start off by managing these themselves: users create uniquely-named accounts and choose passwords, which are then stored in the system’s database. These schemes run into all kinds of trouble as the system grows:

  1. People have a hard time remembering dozens of different account names and passwords, so they either forget them (which adds to the user support burden), or re-use them (which means that when one system is compromised, others can be broken as well).
  2. The more places user information is stored, the harder it is to keep everything up to date. In our department, for example, we have to keep track of over a thousand students as they add and drop courses, change degree programs, and so on. Keeping track of all that is hard; keeping track of it twice would be a nightmare.
  3. Managing passwords and other credentials requires a lot of tricky code. On one of the systems I used to administer, user passwords had to be at least eight characters long, with at least two non-alphabetic characters. They couldn’t contain dictionary words or use simple spe11ing trix, had to be changed every three months, and couldn’t be recycled within a year. This is not code you want to have to write twice…

For all these reasons, DrProject doesn’t manage accounts itself. Instead, it passes the credentials users give it (such as IDs and passwords) to an external program called validate. Here in Toronto, that program checks those credentials against the host Linux system’s password file (Figure 1). At Queen’s, on the other hand, those credentials are checked against the university-wide Kerberos system.

Time for a quick FAQ:

Why use an external program? Why not just have the CGI program check the credentials?
The DrProject CGI program runs under the same user ID as the web server. Typically, this is a dummy account called www-data or apache which has very few privileges (to limit the damage an attacker can do by compromising the web server). We didn’t want to give the web server account access to the password file, so we created a separate program that uses Unix’s setuid mechanism to run under a different identity.
How does DrProject communicate with validate?
DrProject writes the user ID and password to validate‘s standard input, and validate then returns either 0 or 1. It would have been simpler to pass the user ID and password as command-line arguments, i.e., to run validate as a sub-process with validate myName myPassword. However, this would create a security hole: if an attacker with an account on the host ran the ps command at the right moment, with the right flags, she could see validate‘s arguments, and harvest the user ID and password. Also, more complex credentials such as digital certificates can’t be represented as short strings.
Does everybody with a Unix account automatically have access to DrProject?
No. The administrator still has to tell DrProject which of the underlying Unix accounts to recognize. However, that’s all the administrator has to do: when the user changes her Unix password, DrProject automatically “sees” the change.

Now that our user has authenticated, we can move on to—oops, wait a second. HTTP is a stateless protocol: each request is completely separate from each other. We don’t want users to have to re-send their credentials every time they click on a link, so we have to find some way to keep track of them after they have logged in. (In fact, having the system keep track of them after they provide their credentials is what we mean by “logging in”.)

The standard way to do this is with cookies. A cookie is a short piece of text that can be passed back and forth in the headers of HTTP requests and responses. If a CGI program puts a cookie in the header of an HTTP response, then the client can send it back with the next request. The technical term for this is a nonce; like half of a torn playing card, someone can use it to re-establish their identity at some future point.

Back in the bad old days, programmers sometimes used real data as cookies: for example, I remember a system (I actually wrote it *cough*) that used the user ID as the cookie value. Of course, this mean that anyone who knew how to create an HTTP request could impersonate anyone else. Using a sequence of numbers, such as 1000, 1001, 1002, … doesn’t help: an attacker can still:

  • log in;
  • look at her cookie value;
  • add or subtract one; and
  • have good odds of hijacking someone else’s session.

Most modern systems therefore generate a random number (or random string), and use that as a cookie. Internally, such systems store a dictionary that maps cookie values to sessions; each session has a reference to a user profile, and any other data the system needs to remember. One piece of data is when the cookie was generated: if the system is presented with a cookie that’s a week old, it may decide that someone is trying a replay attack, and refuse to accept it. (This is why so many web systems throw you out if you’re idle for too long.)

By themselves, randomly-generated session IDs aren’t enough to make the system secure, because attackers can use packet sniffers and other tools to eavesdrop on network traffic. We actually offer a course (CSC309: Web Programming) that teaches students the principles behind this, so we had to ensure that DrProject wasn’t vulnerable to such attacks. The solution we adopted was simple: every connection to DrProject uses HTTPS, the encrypted version of HTTP. Encrypting and decrypting messages does slow the system down, but (a) the slowdown isn’t very large (a few percent), and (b) slowing down is better than plowing into a lamppost.

One last note on sessions: most people never bother to log out of systems that are less important to them than their bank or dating service. “Stale” sessions can therefore accumulate in the database over time. The space these take up isn’t much of a concern (unless you’re Hotmail, Yahoo, or one of the other giants), but each stale session is another point of attack. DrProject and systems like it therefore have to implement some garbage collection mechanism that sweeps through the sessions periodically, closing and deleting any that have expired.

Implementing this would be easy if DrProject was a long-running process: we’d just set a timer, and take a couple of seconds every ten minutes or so to check the timestamp on every open session. But DrProject is a CGI: it only runs when a request comes in. Our choices were therefore:

  1. have it do a little garbage collection each time it ran; or
  2. use cron to run a separate program every ten minutes or so.

We chose the first option, since it was one less thing for the sys admin would have to install, configure, and remember to restart. The downside is that every once in a while, a user’s attempt to view a page is delayed by a second or two. However, this happens anyway because of network traffic, so people don’t notice.

Next time: how we implemented authorization and access control.

DrProject, Teaching

MonkeyBean

October 26th, 2006
Comments Off

Victor Ng’s company, MonkeyBean, builds web tools for adventure travel companies.  They have a video up on YouTube that’s kind of fun — especially if you like retread 70s TV theme music ;-)

Uncategorized

Award Winners Redux

October 26th, 2006
Comments Off

Pat Smith (ex-49X) was the official photographer at last night’s undergraduate awards ceremony. Maria Khomenko and Jonathan Lung were among the recipients — very pleased to see them recognized.

Teaching

German Version of “Bottleneck”

October 26th, 2006
Comments Off

A German version of my article “Where’s the Real Bottleneck in Scientific Computing?” has just appeared in Spektrum magazine. Pay-per-view, unfortunately, but the Software Carpentry site has had a flurry of hits from .de domains.


Dec 1: I just received my copy in the mail—I sound so much…sterner…in German ;-)

Gregory V. Wilson ist Professor für Computerwissenschaft an der University of Toronto. Sein Kurs ist erhältlich unter www.swc.scipy.org/.

Software Carpentry