DrProject Internals: Subversion

It’s finally time to look at how DrProject integrates with Subversion. “Integrates” is the key word here: whereas we (and Trac’s designers before us) had a free hand with the ticketing system and wiki, Subversion and other version control systems are complex enough that we have to base our design on what they can do, rather than what we might want.

Lucky for us, Subversion’s designers had lots of experience with previous version control systems, and so were careful to provide tools that would make integration easy. The best way to appreciate these tools is to compare the Bad Old Days (CVS in the early 1990s) with our modern utopia. The first time I had to mess around with it, the source code for CVS was a tangled mess—so tangled that the best (possibly only) way to fetch a list of recent commit messages was to run the command in a sub-shell and parse its output. Think for a moment about what that involves:

  1. My application formats a string containing a CVS command.
  2. It passes that string to a shell running in a sub-process.
  3. That shell starts another process to execute the cvs program (unless the PATH variable has been mangled too badly by all this forking and exec'ing).
  4. The cvs program calls a bunch of C functions (some of which might actually starts sub-shells of their own, but that's another story) to extract information from the versioning files and metadata in the repository.
  5. My application reads that command's output as a list of strings and runs it through a handwritten parser that (hopefully) extracts dates, user IDs, and commit messages.

Subversion’s design makes the first three steps are unnecessary. It has a well-defined C API [1], which provides functions for doing (almost) all user-level operations. Command-line programs like svn and svnadmin call these functions, but Subversion also provides adapter libraries to make them available to Python, Java, and other languages. As a result, programmers don’t have to fork sub-processes, or parse strings; they can instead call a function, and get a data structure as a result.

All right: what information do project members actually want about their project’s repository and its contents? “What’s there?” (i.e., a listing of available files) is pretty obvious, along with what’s in particular files, what used to be in them, and a list of change sets. If we’re showing what used to be in files, we ought to show the differences too; and if we’re showing change sets, we ought to provide a multi-file view of the overall differences.

Hm… What about access control? How are we going to ensure that only people who are members of a project [2] can view the contents of the project’s repository? And what exactly do we mean by “the project’s repository”, anyway? Is there going to be one repository for each of the projects DrProject is managing, or would it be simpler and/or safer to partition one big repository into project-sized chunks?

Subversion supports the latter: you can create an access file that gives particular users read and/or write permission for sub-directories within a repository. However, this is what Joel Spolsky famously called a leaky abstraction. To see why, consider a situation in which Olga can read and write both the red and green directories in a repository, but Maxim can only read and write the green directory. If Olga commits changes to updates red/reddish.java and green/greenish.java in a single operation, what should we show Maxim when he asks to view the change set? We can hide the contents of, and changes to, the file he’s not allowed to view, but he’ll still be able to read Olga’s commit message, which may (if Olga is conscientious) tell him a lot about what’s going on in parts of the world he’s not supposed to know anything about.

We therefore decided to use one repository per project. Each of these repositories has its own access file; when users are added to or removed from the project, DrProject modifies the access file appropriately [3]. This means that even if people bypass DrProject, and try access repositories using Eclipse, command-line programs, and other clients, their access rights will always be what we want them to be.

One thing DrProject does not provide is a way for users to modify the repository over the web. In particular, users cannot edit or commit files through their browser. We left this out for several reasons:

  1. It opens up a channel for attack: if the DrProject CGI is able to modify the repository, then anyone who subverts the CGI can do a lot of damage to the project's core resource.
  2. We didn't believe anyone would ever actually do any significant code editing through an HTML text editing box. (This may change in future, as rich editing controls become common; even today, it'd be nice to be able to add comments to commits after the fact.)
  3. Implementing it---in particular, implementing conflict resolution---would have at least tripled the complexity of this part of DrProject.
  4. Nobody else's system has it, so we figured there couldn't be a crying need ;-)

It turns out that accessing a repository via the Subversion API is a lot slower than querying a PostgreSQL database. To keep things zippy, DrProject caches the information it gets about the repository in the database, so that future lookups will be faster. This information is WORM (write-once, read-many): once it’s in the database, it stays there forever, and is never changed (except in those very rare cases when someone actually does edit a commit log message after the fact outside DrProject, in which case the database information is resynchronized).

This isn’t as big a disk hog as you might think, since most of what’s in a repository is never viewed over the web. However, we are a little worried about what might happen if we provide a web services API, so that people can write scripts to pull data out of DrProject. While a human being might not click on dozens of links to pull up all the files, revisions, change sets, and differences in a project, a script very easily could. We’ll see…

So how well is it working? Pretty well, although students don’t seem to use DrProject’s Subversion browser very much. One reason might be that they don’t need it—in small projects, done by small teams, on short timescales, the history of a project isn’t that important. Another reason may be that desktop tools (command-line programs and Eclipse plugins) give them a richer experience. Still, it does seem to be running smoothly, and the wiki formatting of commit messages, which automatically creates links to tickets, is something I personally rely on a fair bit.

[1] Yes, Subversion is written in C, for both execution speed and portability. The last may seem like an oxymoron, given the bajillions of #ifdefs programmers have to use to actually make their C portable, but these have the advantage of being well understood. Getting C++ or Java to work on multiple platforms is actually no easier.[2] More specifically, have a role with respect to that project that includes the capability to view the contents of the repository.

[3] This is only true if administrators use the functions in DrProject to edit user permissions. If someone edits the underlying database directly via SQL, it obviously won’t have the desired side effect of updating the Subversion access file.