One of the cornerstones of modern learning theory is legitimate peripheral participation: the idea that newcomers become members of a community of practice by participating in low-risk/low-effort tasks that are genuinely useful the community while giving the newcomers a chance to learn vocabulary, meet people, and internalize the community’s unwritten rules. Driving people to polling booths on election day is an example: it’s important, but it doesn’t require a lot of training or a long-term commitment to the party.

During a conversation with Dan Sholler yesterday, we realized that there’s an important difference between open source software development and open data. A lot of open source projects have internalized the idea of legitimate peripheral participation, and work hard to ensure that newcomers can get set up easily and find entry-level tasks to work on. (The OpenHatch project did a wonderful job of matching newcomers to projects; it’s a real shame it has wound down.)

Our realization was that there isn’t something like this for open data–at least, not for graduate students and other “serious” scientists. As a layperson, I can contribute to any number of citizen science projects by counting birds, sampling my local water supply, or identifying galaxies on my cellphone. But what’s the next step? If I’m a graduate student, I can upload the data I’ve collected to one of many different sites for other people to use, but that’s like saying that as a programmer, I can write a brand new package for Python. Most open source contributors hone their skills by contributing patches to existing projects first; what’s the equivalent of small, incremental contributions to a community dataset for people who are training to be full-time scientists, but are still on the periphery of their community?