Home > Uncategorized > Summer of Code on One Page

Summer of Code on One Page

June 2nd, 2006

This year’s Summer of Code recipients were announced last week. I wanted to browse the list off-line, but doing it on the SoC site would have meant clicking through 102 separate pages (one per sponsoring organization). No problem: Python’s urllib lets me download pages as easily as I’d read files, and with minidom, I can parse them, and pull out the information I want.

…except that the HTML on Google’s site doesn’t escape attributes properly: there are many uses of class=foo, instead of class=”foo”, and similar potholes. OK, my 10-line script turns into 20 lines to transform these so that minidom is happy…

…and then I run into the problem of character encodings and HTML entities. The polite, professional thing would be to spend 10 minutes remembering how to get the Polish-L-with-a-slash-through-it to display properly in Firefox, and print correctly. Instead, I add another ten lines to my script to translate the non-ASCII as I go, and bing, there’s the page I wanted.

So yes, it probably would have been quicker to click-print-back-down 102 times, but I’ve saved some trees this way, and can share my results with you.

Uncategorized

  1. June 2nd, 2006 at 09:22 | #1

    Wouldn’t it have been easier to use a non-validating parser like beautifulsoup? Then all those quote issues wouldn’t have been a problem, and I saw that the most recent beautifulsoup has code which at least tries to guess encoding for you.

  2. June 2nd, 2006 at 09:23 | #2

    Oh, and thanks – that’s a useful page. I clicked through on the interesting projects that I knew on the google site, but this is much easier to read.

Comments are closed.