Computational Competence for Biologists

Posted 2013-07-16

On July 8 and 9, I had the pleasure of taking part in a two-day workshop at SESYNC to discuss what we ought to teach biologists about computing. It was a relatively small meeting, but the participants spanned the range from computer scientists and systems engineers to bioinformaticians, field biologists, and a few odd ducks like me.

One of our group exercises was to design a proficiency exam to determine whether a biologist was computationally competent. This was inspired by the driver's license exam we've helped put together for the DiRAC supercomputing consortium, but we weren't seriously proposing such a test. Instead, we wanted to use it to focus discussion of what we all actually mean by "computational competence" when it comes to biology.

The five groups' submissions are included below. The most interesting thing for me was how discussion was dominated by data rather than computation. I was also struck by how much agreement there was between the groups, though this might have been a result of the way the question was posed. Other common themes included:

Documenting process for others
Reproducibility of results
Knowing how to test results
Managing errors
Posting to places like GitHub, BitBucket, and Figshare–the concept was more important than brand–to make work sustainable even when students move on
Database management–not just how to write queries, but also how to create a sensible schema

If this is what computationally skilled biologists think their peers need to know, then we need to rethink and rewrite some of our material. On the bright side, there's clearly an enthusiastic audience for what we're doing, and they clearly think we're making their lives better.

Group 1

Data

Does your data have any metadata?
If someone looks at your metadata, what can they say about it?
Is your metadata machine-readable?
Here is a metadata example. Fix it so that it is machine readable.
Can you (programmatically) access a remote computer and retrieve a set of data? [provide the details to connect and the dataset which needs to be accessed]

Programming

Here is a set of 400 files with data (text files), they should only contain digits. Check if there are really digits only.
Format this messy dataset in a standard data format (of your choice: JSON, xml, csv, SQL)
Can you write a script that will not change the original dataset (we have both the original/raw data as well as the processed data)?
Here's a messy dataset and a lookup table…

Testing

Can a reviewer of your paper can re-run a set of tests and see the output?
Do you have any unit tests in your analysis?
Do all of your tests pass?
Is your analysis programmatic? If not, we can't accept that [for publishing a paper].

Reproducibility

Can you reproduce your own results?
Is your analysis scripted?
If April (she's an impatient nerd) looked at your paper supplement can she reproduce the same results by running a script/series of scripts??
Some kind of question about version control (Git?). How do you keep track of your work?
Is your work under any form of formal version control?

Group 2

Version control: we have created a Git repo containing a spreadsheet. Clone the repo to get the spreadsheet.
Data management: looking at the spreadsheet, draw the relational database schema you'd use to store the equivalent data (or write as SQL).
Metadata: create human-readable documentation explaining the schema.
Data management: write a query that joins two tables from that database to pull out some data.
Programming: find and install a package in [language of choice] to do a specified calculation on the result of the query.
Programming: write a (very short) piece of code to create a visualization of the output of #5. Will include looping and randomization; required to write and call a function.
QA: find and fix the bug in the routine being used from the package from #5 (revealed by #6).
Shell: log in to remote machine and run fixed function on all data sets found there.
Version control: commit code to version control repository.

Group 3

Here's a data sample. What would you need to fix in order to make it so you and others could use it?
Name and organize 3 data files (i.e., .csv, .dat, .txt)
Run this program on one of these files?
How would you capture that process for someone else to use?
Suppose you change the program. How do you convey that information?
Suppose someone sends you a changed version of the file/program. How do you interact with it?
How would you know that your program is doing what you want it to do?
How would you make your files available to others?
What additional data would you want to include??

Group 4

Here is a sample table, how can it be improved
- Don't have multiple data points in a data field
- Separate a value from its units
- Consistent values in an ENUM column
- No garbage/Weird descriptors for NULL
- Consistent units in a column
- Separators in a data field
- Headers
- This tests whether they understand how to get their data into a shape where it is ready to analyse
What of these things can be done by a computer? (multiple choice?)
- e.g. "give me all the correlations between all the variables"
- but not "which of these correlations is most interesting, which is causal"
- This tests their understanding of what computational tools can help them with, and where they can't
Here is some data consisting of repeated measurement of a mouse spinning on its wheel at different time intervals
- Can you search and replace a particular error in the dataset in an automated way, for instance splitting multiple activity measurements recorded in same field
- reshaping going from short and fat to tall and skinny, splitting to
- What day was the second least active day for each mouse, and what was its activity level
- This tests that you can structure your data properly in an automated way (so you don't have to do it manually every time)
Here is an initial training dataset, do something to predict a phenomenon that will be seen in the test dataset. We'll now give you another dataset.
- Analyse the training dataset, and identify what you think are significant phenomena.
- Compare data from the two sets (both training and new). With this additional information, which phenomena are statistically supported by the new dataset?
  - e.g. which pairwise comparison do you think will reappear, i.e., A can always be distinguished from B, what is the signal to noise ratio, etc.
- This tests:
  - I understand the concept of overfitting
  - I understand signal to noise ratio problems
  - I understand how not to overpredict, significance
  - that they have actually made progress towards automation, exposes hardwiring etc.

Group 5

Given:

CSV of columns (deer count, county)
tab-separated text file of columns (tick count, county, human count)

do the following:

Produce a third file, CSV, of columns (county, deer, tick, human) using blank when missing from one file or other. Any language or tool, but must be significantly faster the second time with two new datasets than the first.
Make plots of tick vs deer, human vs deer, tick/deer vs human (and as a bonus, visualize them on a map)
Find Lyme disease data (online) and integrate it into plot (and as a bonus, add linear regression)
Write very detailed instructions and examples and post them on a blog (or GitHub page, etc) for a very basic reproducing user.
Post the code and instructions on GitHub
Post data & figures on Figshare (or data repository of choice)

Given:

A Galaxy workflow

do the following:

Copy this workflow and rerun on a different dataset
given a dataset with a slight difference, prepare the dataset for the workflow and run it (cleaning)
Change the workflow at step 3 of 17 (instead of filtering out counties < 100000 people, do < 1M), or to produce a different output

Given:

A script (R?) to read and plot data (steps 1&2)

do the following:

The script crashes (or produces wrong output) when data is missing from one file. Modify it to work.

Categories: software-carpentry