Fifty... Million... Calls

“House” and “CSI” (the original) are two of my favorite TV shows. Partly, it’s the characters—Hugh Laurie and Marg Helgenberger are both brilliant—but I also enjoy watching smart people debugging. Sure, they’re dealing with diseases and crime scenes, but deep down it’s all the same: something’s wrong, and you have to figure out what.

The Argon team spent a lot of the last five days doing exactly that. For those of you who haven’t been following along, Argon is a fork of Trac, a lightweight web-based project management framework that’s implemented in Python. We’ve been working for several months to modify it so that we can use it to run undergraduate programming projects; the interface has been redesigned, it now handles multiple projects, we’re using PostgreSQL instead of SQLite (partly for performance, and partly because SQLite contains a synchronization bug), and so on.

A couple of weeks ago, we finally installed it on CDF, the Computing Disciplines Facility servers that host undergraduate work during term. It took longer than anticipated to find a set of file ownerships and permissions that would allow students access to their projects, but only their projects, while giving TAs and lecturers access to everything.

We got it sorted out eventually, but then we ran into another problem: our application was really, really slow—it was taking 10 seconds or more to service each CGI request. There was no way we could put it into production as it was.

While we were poking around at our end, Chris Lenz had a look. The first thing he told us was that we should map the htdocs directory (the one containing all the static resources) to some URL that wouldn’t trigger an invocation of the CGI, so that Apache wouldn’t have to invoke Python for each image or icon.

The second thing was that our code ran in under half a second on his unplugged Powerbook. He and Keir Mierle (one of the students working here in Toronto) profiled the code, and found that almost all the difference lay in the time it was taking the CGI to import the libraries it needed. On Chris’s machine, this was a few hundred milliseconds; on ours, it was several seconds.

All right, down a layer we go. Keir profiled Python itself using gprof, and found this:

% time cumulative seconds self seconds calls self s/call total s/call name
38.46 0.70 0.70 51442563 0.00 0.00 _PyUnicodeUCS2_IsLinebreak
15.93 0.99 0.29 28141 0.00 0.00 PyUnicodeUCS2_Splitlines
3.85 1.06 0.07 83334 0.00 0.00 classify
3.30 1.12 0.06 55512 0.00 0.00 PyFrame_New
2.20 1.16 0.04 1233298 0.00 0.00 _PyUnicode_New
2.20 1.20 0.04 50779 0.00 0.00 list_dealloc
1.65 1.23 0.03 1416785 0.00 0.00 app1
1.65 1.26 0.03 660223 0.00 0.00 dict_subscript
1.65 1.29 0.03 515599 0.00 0.00 PyDict_GetItem

That’s right: for some reason, each time the CGI ran, it called the function that sees whether a character is a Unicode end-of-line marker 50 million times.

Hm… If we were parsing XML or some other data files, I could see why we’d call _PyUnicodeUCS2_IsLinebreak, but this was happening as we were importing libraries. Ah, but: each of the source files we inherited from Trac began with the line:

# -*- coding: iso8859-1 -*-

This is an Emacsism that recent versions of Python have adopted to tell the interpreter how the characters in the source file are encoded. (If you don’t know what that means, please check out Joel Spolsky’s excellent article.) When we took those pragmas out of our files, the number of calls to _PyUnicodeUCS2_IsLinebreak dropped to 29 million. That’s still a lot, but it was a sign we were on the right track.

We then got tangled up for a while with .pyc files. When Python imports a .py module, it compiles it to bytecode, and then saves the bytecode to disk. The next time it’s asked to import the module, it checks the timestamp on the corresponding .pyc; if the .pyc is fresher, Python slurps that into memory instead of recompiling the source. Of course, this only works if the Python interpreter has permission to write to the directory that the .py file is stored in. In our case, it couldn’t, because the interpreter was running with the user ID of the web server, and giving it write access to the source directory would open up a fairly large security hole.

Our solution was to compile all the .py files to .pyc’s manually every time we updated the source code (just as Python does to its standard library when you install it on a new machine). For a while, though, we were running our tests with some .pyc’s deleted, but others still present, which is why we were still getting those 29 million calls to _PyUnicodeUCS2_IsLinebreak. Once we got that cleaned up, the number of calls dropped to zero, and the run time went from over 10 seconds per hit to 2 or less—well within our comfort zone.

OK, we’d fixed our problem, but we still didn’t really understand it. Karen Reid (my co-supervisor) and I have learned the hard way that just making bugs go away usually isn’t enough: if you don’t understand where they came from, they’ll probably come back when you least expect them.

We had Keir post a note to the Python developers’ list. It went out at 16:10 on Tuesday, August 23. At 05:45 the next morning, Walter Dorwald replied to say:

This is caused by the chances to the codecs in 2.4. Basically the codecs no longer rely on C's readline() to do line splitting (which can't work for UTF-16), but do it themselves (via unicode.splitlines()).

Half an hour later, Martin v. Lowis followed up with:

That explains why you get any calls to IsLineBreak; it doesn't explain why you get so many of them. I investigated this a bit, and one issue seems to be that StreamReader.readline performs splitline on the entire input, only to fetch the first line. It then joins the rest for later processing. In addition, it also performs splitlines on a single line, just to strip any trailing line breaks. The net effect is that, for a file with N lines, IsLineBreak is invoked up to N*N/2 times per character (at least for the last character).

Yup—the reader was invoking _PyUnicodeUCS2_IsLinebreak N2 times. Square root of fifty million is… carry the three… about seven thousand, which happens to be about the number of lines of code we were loading.

There have been over a dozen messages on the Python developers’ list since then about how to fix this. The most interesting part for me hasn’t been the proposals themselves, but rather the fact that the first thing Walter and Martin did was write little test programs and profile them to find out how fast things would actually run. To put it another way, good programmers like Walter and Martin automatically apply the scientific method to problems—they design and run experiments, collect data, and then formulate a hypothesis (a.k.a. a bug fix). It’s cool to be able to watch this in action, and even cooler to have code that runs fast enough to be called interactive.

In the wake of posts about Shopify's support for white nationalists and DataCamp's attempts to cover up sexual harassment
I have had to disable comments on this blog. Please email me if you'd like to get in touch.