Home > Uncategorized > Real World Data Causes Perl

Real World Data Causes Perl

April 3rd, 2008

The title was Blake Winton‘s reaction to my re-telling of Margaret Menzin‘s story from this year’s “It Seemed Like a Good Idea at the Time” session at SIGCSE. She asked her students to write a program that would sort names in library order (i.e., “Gregory V. Wilson” would be sorted under “W”). Turns out the rules are, well, real-world-ish:

  • Leonardo da Vinci (“da Vinci” just means “from Vinci”)
  • Catherine de Medici (family name)
  • Juan Ponce de Leon (full family name is “Ponce de Leon”)
  • Jean de La Fontaine (family name is “La Fontaine”)
  • Gabriel Garcia Marquez (double-barrelled Spanish surnames)
  • Wernher von Braun (depending on whether he was in Germany or the US)
  • Elizabeth Alexandra May Windsor (monarchs sort by the name under which they took the throne)
  • Thomas a Beckett (special rules for saints, too)
  • Mao Tse-tung (Chinese family names come first)

Oh, and if you have names written in multiple languages? You sort the languages first, according to their names in English, then sort within each language using its own rules. And we haven’t even started on:

  • Hon. Father Robert F. Drinan, S.J., L.L.D.
  • Rev. Dr. Martin Luther King, Jr.
  • Augusta Ada Byron King, Lady Lovelace
  • Major General Stanley

Later: the indefatigable Adam Goucher pointed me at NACO, who decide such things. Real world data is guilty of a lot more than Perl…

Uncategorized

  1. April 3rd, 2008 at 23:27 | #1

    I have written site for book readers. Thanks god I have decided not to sort them correctly :-)

  2. Leo
    April 4th, 2008 at 10:48 | #2

    I think the example is rather contrived. This “library order” consists of arbitrary rules that are only necessary if you need a well-defined sorting order.

    For a real-world application (an index or any other sorted structure intended to be searched by humans) you would probably duplicate the names:
    - da Vinci, Leonardo: cf. Leonardo da Vinci

    - Leonardo da Vinci -> …

    - Vinci, Leonardo da: cf. Leonardo da Vinci

    I believe this is generally done in good books to reduce the number of “cache misses”. Humans don’t probably know the rules, either, so providing only one item is inconvenient.

    It has the advantage of being easier to implement, as well ;-)

Comments are closed.