The W3C's Provenance Working Group recently published a new draft of their proposed standard for tracking provenance on the web. It's pretty dense stuff: even the primer, which uses the word "intuitive" five times to describe itself, is hard to follow if you've never been immersed in Dublin Core, TURTLE, and the like. That isn't a criticism—this stuff is intrinsically hard—but I think most scientists won't be able to see the forest for the trees.
Which raises a question: if this is the C, where's the Python? I don't mean, "Where are the libraries?", but rather, if this is the low-level detailed language for describing provenance, where's the 80/20 version that'll do what most people need with much less palaver? Loren Shure and I talked about this briefly at the recent ICERM meeting, and if we can put something together that:
- is a strict subset of the W3C proposal,
- works with a variety of files formats (e.g., CSV, JSON, MAT, HDF5, PDF, and PNG), and
- requires people to add no more than a couple of function calls to their code
then I think we could actually get people to adopt it.
Later: In reply to some early feedback, I think provenance needs to be stored in the files themselves, rather than beside the data, so that it's easier to move from place to place. I've been wrong before, though... :-)
Originally posted at Software Carpentry.