Stymied

A few weeks ago, I asked for pointers to something that would translate 400-odd partial paper citations (usually just author names and paper titles) into full bibliography entries.  So far, none of the proposed solutions have worked out:

  1. Bibsonomy’s search only found 2 of the 10 papers I tried it on manually.
  2. JabRef did no better.
  3. CrossRef found none.
  4. The “possibly working Python script” for querying Google Scholar, and my home-grown attempts to do likewise, produced nothing but “your agent isn’t allowed access” messages. (Yes, I’ve included all the same headers that Firefox sends when I do searches manually—doesn’t make a difference. Google, they smart.)
  5. I didn’t get cb2bib to run on Mac OS X. Admittedly, I only tried for ten minutes, but this isn’t a critical path project for me.

*sigh*

6 thoughts on “Stymied

  1. Eric O. LEBIGOT (EOL)

    Sorry to hear that your attempts have not produced much so far… That’s an interesting project!

    With regards to cb2bib, I’d like to point out that it is available via Fink (at least in the unstable branch).

  2. Aldo Chan

    OK, about fooling google with the User-Agent setting (if you’ve tested your script against a local web server you’ll notice that the User-Agent header is still set to urllib-blah-blah) you’ll need to subclass urllib.URLOpener and set the class attribute version to your desired header.
    like this: http://paste.pocoo.org/show/156877/

  3. Tony Wiliams

    Have you considered reverse engineering one of the Mycroft project Firefox search plugins? They include many searches of Google scholar that use various proxies.

    Surely you could build a python script with the hints from one of these.

    // Tony

  4. Nathan

    Have you looked at Mendeley? It’s a pretty nice solution for managing a library of academic papers. One of it’s features is integration with Google Scholar. I’ve found it works pretty well but it does require some manual intervention.

    http://www.mendeley.com/

  5. Jim Graham

    The problem with any script that attempts to scrape Google Scholar is that Scholar detects spikes in activity from your IP (for some value of ‘spikes’) and will send you a Forbidden until you revalidate via CAPTCHA as a human. Scraping and scripts are against the Google ToS.

    AFAIK, this is because Google gives scholar so much less bandwidth than the main search engine and they defend it much more than the regular search engine.

Comments are closed.