Real World Data Causes Perl
April 3rd, 2008
The title was Blake Winton‘s reaction to my re-telling of Margaret Menzin‘s story from this year’s “It Seemed Like a Good Idea at the Time” session at SIGCSE. She asked her students to write a program that would sort names in library order (i.e., “Gregory V. Wilson” would be sorted under “W”). Turns out the rules are, well, real-world-ish:
- Leonardo da Vinci (“da Vinci” just means “from Vinci”)
- Catherine de Medici (family name)
- Juan Ponce de Leon (full family name is “Ponce de Leon”)
- Jean de La Fontaine (family name is “La Fontaine”)
- Gabriel Garcia Marquez (double-barrelled Spanish surnames)
- Wernher von Braun (depending on whether he was in Germany or the US)
- Elizabeth Alexandra May Windsor (monarchs sort by the name under which they took the throne)
- Thomas a Beckett (special rules for saints, too)
- Mao Tse-tung (Chinese family names come first)
Oh, and if you have names written in multiple languages? You sort the languages first, according to their names in English, then sort within each language using its own rules. And we haven’t even started on:
- Hon. Father Robert F. Drinan, S.J., L.L.D.
- Rev. Dr. Martin Luther King, Jr.
- Augusta Ada Byron King, Lady Lovelace
- Major General Stanley
Later: the indefatigable Adam Goucher pointed me at NACO, who decide such things. Real world data is guilty of a lot more than Perl…
I have written site for book readers. Thanks god I have decided not to sort them correctly
I think the example is rather contrived. This “library order” consists of arbitrary rules that are only necessary if you need a well-defined sorting order.
For a real-world application (an index or any other sorted structure intended to be searched by humans) you would probably duplicate the names:
- da Vinci, Leonardo: cf. Leonardo da Vinci
…
- Leonardo da Vinci -> …
…
- Vinci, Leonardo da: cf. Leonardo da Vinci
I believe this is generally done in good books to reduce the number of “cache misses”. Humans don’t probably know the rules, either, so providing only one item is inconvenient.
It has the advantage of being easier to implement, as well