Steps in Order

Our book Research Software Engineering with Python has finally gone to the publisher. We will undoubtedly still have to make revisions, but the ideas and examples are complete. Our intended readers are captured in these three learner personas:

  • Amira Khan completed a master’s in library science five years ago and has since worked for a small aid organization. She did some statistics during her degree, and has learned some R and Python by doing data science courses online, but has no formal training in programming. Amira would like to tidy up the scripts, data sets, and reports she has created in order to share them with her colleagues. These lessons will show her how to do this.

  • Jun Hsu completed an Insight Data Science fellowship last year after doing a PhD in Geology and now works for a company that does forensic audits. He uses a variety of machine learning and visualization packages, and would now like to turn some of his own work into an open source project. This book will show him how such a project should be organized and how to encourage people to contribute to it.

  • Sami Virtanen became a competent programmer during a bachelor’s degree in applied math and was then hired by the university’s research computing center. The kinds of applications they are being asked to support have shifted from fluid dynamics to data analysis; this guide will teach them how to build and run data pipelines so that they can pass those skills on to their users.

We organized the book around a running example: the verification of Zipf’s Law using a set of classic English novels in an open, reproducible, and sustainable way. (People often conflate these three ideas, but they are distinct). To do that, we teach readers to do these things:

  1. Use the Unix shell to efficiently manage data and code.
  2. Organize small and medium-sized data science projects.
  3. Write Python programs that can be used on the command line.
  4. Use Git and GitHub to track and share work.
  5. Work productively in a small team where everyone is welcome.
  6. Use Make to automate complex workflows.
  7. Enable users to configure software without modifying it directly.
  8. Test software and know which parts have not yet been tested.
  9. Find, handle, and fix errors in code.
  10. Publish code and research in open and reproducible ways.
  11. Create Python packages that can be installed in standard ways.

The order is important because later skills depend on earlier ones, but also because we want people to be able to stop part way through and still have a workable research process. If you only get through the fourth chapter, for example, you’ll be able to back up your work, share it with others, and re-run analyses with a single command. Another chapter and you’ll be ready to collaborate in a team; one more after that, and you’ll be able to capture your workflows in re-runnable ways.

We don’t believe any one book can serve everyone’s needs, but we hope this one will help people who already know how to write a bit of code figure out what to learn next and what “done” looks like. The HTML version of the book will stay online for free and forever; we’ll advertise the printed and e-book versions as soon as they become available. If you find it useful, please let us know (and please also let us know about any errors or murky wording you stumble over).

This book has been over twenty years in the making; many thanks to my co-authors Damien Irving, Kate Hertweck, Luke Johnston, Joel Ostblom, and Charlotte Wickham for finally making it a reality. All royalties from the book will be donated to The Carpentries, an organization that teaches foundational coding and data science skills to researchers worldwide.

In the wake of posts about Shopify's support for white nationalists and DataCamp's attempts to cover up sexual harassment
I have had to disable comments on this blog. Please email me if you'd like to get in touch.