I started working on a short capstone example last month to show learners how to get a badly-formatted reference list out of an Excel spreadsheet and into a relational database so that it would be easy to answer questions like, "Who has co-authored papers with whom?" I'd like to work up another capstone as well, but there's a problem: I can't actually do it myself for reasons that are both technical and political.
The problem is easy to state: given a bibliography (like this one), find all papers published in the last N years that aren't in the bibliography but which reference items that are. The use case is someone wanting to find out what's been done in some area after they've been away from it for a while, and from a computer science point of view, the solution is easy:
- Get the DOIs of all the papers in the bibliography.
- Look up all papers whose references include one or more of those DOIs.
- Subtract everything published after 2010 (or whatever the "since" year is).
A third of the bibliography entries don't have DOIs, so the set of known DOIs built in step 1 will be incomplete. That's not fatal in this case, since anything that references one entry in the bibliography will probably also reference others.
The real problem is step number two—the one where the problem stops being technical and starts being political. I haven't been able to find any publicly available service that lets me search for papers whose references include one or more DOIs. "Publicly available" is important here: I live outside the Great Paywall of Academia; many of the tools that most scientists can use are blocked both for me and for many of our learners (particularly those in developing countries). I could solve that problem by wrangling a courtesy appointment at some university or other, but that would feel like cheating, because it would only solve the problem for me.
All of which brings us to George Orwell's essay on Charles Dickens. In it, Orwell said that while Dickens saw the sins of Victorian society very clearly, he never allowed himself to see that those sins were part and parcel of how that society was organized. The protagonists in Dickens' novels might be saved individually—an unexpected inheritance, a fortunate encounter with a pious benefactor, and the like—but such solutions leave the equally deserving people standing to their left and right still mired in muck. Equally, having friends who are still in academia mail me PDFs of papers doesn't help my neighbors, much less allow me to build a tool that would let me find researchers I should be talking to.
In 2013, I said that we would know we had won when scientists stopped submitting, publishing, and downloading papers, and started forking and merging projects instead. I now realize that goal was incomplete; what I should have said was:
Scientists won't submit, publish, and download papers.
They will fork and merge projects
that everyone has equal access to.
Friends sending me PDFs of papers won't bring that world closer, any more than giving one child a hot meal and a pair of shoes would have ended poverty in Dickens' London. Teaching scientists the skills they need to build the program I outlined will, but even that won't be enough. What we really need is for scientists to use what we're teaching to change the practice of science, so that anyone who wants to can write and run such tools. As Emma Lazarus said:
Until we are all free, we are none of us free.
Some people I respect said on Twitter that I shouldn't have used the Emma Lazarus quote in this post, since it has particular significance in the American civil rights movement. Saying "I didn't know that - I'm Canadian" is an explanation but not an excuse: it's ridiculous and offensive to equate PDFs with people, and I apologize for giving the impression that I was doing so.
This post originally appeared in the Software Carpentry blog.