Miscellaneous Projects

This post is a bit of a link fest, but after talking about how to contribute at yesterday's instructor retreat, I thought it might be useful to post a few additions to our projects page.

  • In a "Parsons Problem", you are given the pieces of the solution and asked to arrange them in the correct order. For example, given the words "cat", "eating", "quietly", "is", and "the", you could be asked to construct a grammatically correct sentence that explains what the cat is doing. js-parsons is a Javascript library for building Parsons Problems for programming. It would be cool to incorporate it into our lessons (and to see whether it's easier to auto-grade solutions to such exercises than solutions written from scratch).

  • A growing number of programmers are coding live online so that people can see what they do and hear how they think. (My favorite is Mike Conley hacking on Firefox.) I would really like to see some Carpentry instructors livecast some data analysis and coding sessions—volunteers?

  • daff is an R package that can find difference in values between data.frames, store this difference, render it, and apply this difference to patch a data.frame. It can also merge two versions of a data.frame having a common parent. I've been saying for years that better support for diffing and merging things would make it a lot easier for scientists (and everyone else) to adopt version control; is daff a step toward this? If so, should we teach it in our workshops?

  • I really like our lesson on SQL, but as non-relational databases become more popular, I wonder if it's time for us to cover them as well. Setup wouldn't be as big an obstacle as it once was, thanks to libraries like tinydb. And hey, we can even run SQL queries directly on CSV files.

  • It would also be useful to integrate this comparison of SQL and Pandas into our lessons. (Of course, we'd have to integrate Pandas first...) There's also this NumPy for R guide, this matrix cheatsheet, and I'm sure we can find others.

  • And speaking of lessons, it would be great to have more people contribute to Fiona Tweedie's lesson on natural language processing (possibly incorporating material from Allison Parrish). I don't know if this should be combined with a user-friendly replacement for our old lesson on regular expressions, but it should definitely include this example of the problems with manually-entered data.

  • I think our learners would also like a couple of hours on image processing (the Version 4 lesson on multimedia is woefully out of date, and was never particularly well organized)...

  • ...but most of all I want a lesson on how to write and publish in the early 21st Century. Word or LibreOffice? Version control can't handle them. Markdown or LaTeX? A lot of scientists are simply never going to adopt them, even with help like this. Are Authorea or Google Docs the happy medium? We might not be able to offer a single best solution, but we should at least be able to explain the tradeoffs. And that's just authoring: if we really want to help, we should also explain Depsy, Hypothes.is, Zenodo, ORCID, new publishing models, and all the other cool things that the one percenters of open science now take for granted but their colleagues have never used.

  • OK, I wasn't done with lesson ideas. I think Abela's chart chooser (adapted from Zelazny's Saying It With Charts) would be a brilliant way to organize a lesson on data visualization: it's a useful medium between Tufte's high-level pontificating (I never know what to do after reading his books) teaching people how to center labels under the ticks on the X axis, and scientific publishers would be very grateful if more authors followed these rules. Bonus marks for integrating Heer's tour of the visualization zoo into the lesson as well.

  • The errors and exceptions topic in our standard Python lesson is really useful: programmers at all levels spend much (or most) of their time trying to diagnose and correct errors, but most textbooks and courses give the topic short shrift. (If you want a really cool project, create a new programming language with easy-to-understand errors as the most important design criterion.) I'd like to incorporate things like What the What, explainshell, and similar tools into our teaching. (I'd also like to add something like Sumana Harihareswara's "Inessential Weirdnesses in Open Source", just so that learners know this stuff really is harder than it needs to be.)

  • Would anyone like to go through the latest recommendations for data management plans, compare what they're asking for with what we teach, and file issues to fill in the gaps? Because one useful goal of our workshops would be to give attendees the understanding needed to write and critique data management or software sustainability plans. DataONE's Data Management Guide for Public Participation in Scientific Research might also figure into this.

  • We'd like to rebuild our web site using the same "Feeling Responsive" theme that the Data Carpentry site uses. We've made good progress, but there's a lot left to do. If you have good CSS and web design skills, or if you're up for a bit of Python hacking, this is the job for you.

In the wake of posts about Shopify's support for white nationalists and DataCamp's attempts to cover up sexual harassment
I have had to disable comments on this blog. Please email me if you'd like to get in touch.