GSoC Stats: The Last 10% Is The Hardest

As I said in a post on April 4, I’m trying to figure out how many Google Summer of Code students have come from different schools since the program started in 2005. The question was prompted by the release on April 2 of a spreadsheet of statistics about the program (blog post here, spreadsheet here). According to the sixth tab, the University of Toronto stands second overall for number of accepted students (just behind Sri Lanka’s University of Moratuwa—go UM!).

The problem is, I don’t believe the figures for 2005 and 2006. I could well believe that a couple of U of T students slipped by me in those years, but according to the spreadsheet, I somehow failed to notice 14. I know I’m getting sloppy in my old age, but: 14? Really? And if the numbers for U of T are off, why sh0uld I have any confidence in any of the spreadsheet’s other data?

The simplest way to check the numbers would be to get a complete organization/project/mentor/student/school list from Google, but they’re not allowed to hand that out for legal reasons (which I support). I could try trawling the web sites of the organizations that took part in those years to see if they’ve posted a list, but (a) that would take more time than I’m willing to invest, and (b) the chances of success are low. I had a look at the KML files used to produce the mentor/student pairing maps from 2005, 2006, 2007, and 2008, but as I pointed out in my earlier post, they’re not consistently formatted, and don’t contain school information anyway.

Long story short, most of the information I want is freely available, by which I mean that getting it won’t cost me either money or (significant) time. However, the last 10% that I need to answer my question is too expensive for me to collect: I know it must be out there, but it would take days (at least) to track it down. This has happened to me so often recently that I’m starting to wonder if I’m tripping over some kind of long-tail/power-law distribution for the availability of data (which intuitively makes sense if information behaves like a sparsely-connected graph). The alternative (which also makes intuitive sense) is that Skynet is toying with me; I’m not sure which I’d rather believe…