I have to fix two bugs in the examples,
draw 91 diagrams,
fill in an appendix on cognitive load,
and then revise all 52,000 words,
but the first draft is done.
Feedback would be greatly appreciated:
you can mail me
or file issues in the book’s GitHub repository.
Using callbacks to manipulate files and directories
Using promises to manage delayed computation
Testing software piece by piece
Archiving files with directory structure
Loading, saving, and manipulating tables efficiently
Using patterns to find things in data
Turning text into code
Generating HTML pages from templates
Updating files that depend on other files
Figuring out what goes where in a web page
Managing source files that have been broken into pieces
Loading source files as modules
Checking that code conforms to style guidelines
Modifying code to track coverage and execution times
Generating documentation from comments embedded in code
Turning many files into one
Getting and installing packages
Assembling and running low-level code
Running programs under the control of a breakpointing debugger
By popular request,
here is my spouse’s recipe for pickled carrots,
which is derived from the one in Topp and Howard’s excellent
The Complete Book of Small-Batch Preserving.
Note that this is a canning recipe, not a cold pickle.
Note also that the original recipe calls for only ¼ tsp of hot pepper flakes per jar,
to which we say, “Bah.”
finely chopped fresh oregano or 1 tbsp (15 mL) dried
hot pepper flakes (per jar)
small cloves garlic
peeled baby carrots
1 ½ cups
Remove the hot jars from the canner and add one garlic clove and your desired volume of chili flakes to each jar.
Pack in the carrots (see picture) leaving 1cm (½ inch) of head space.
Combine vinegar, sugar, water, and salt in a small saucepan and bring to a boil.
Pour hot liquid over carrots (in jars) to within ½ inch of the top.
Process for canning: 15 minutes for 500 mL jars.
A 2lb (approx. 1kg) bag of baby carrots will make 1750 mL of pickle.
We usually use 250 mL wide-mouth jars—they make great gifts.
completed a master’s in library science five years ago
and has since worked for a small aid organization.
She did some statistics during her degree,
and has learned some R and Python by doing data science courses online,
but has no formal training in programming.
Amira would like to tidy up the scripts, data sets, and reports she has created
in order to share them with her colleagues.
These lessons will show her how to do this.
completed an Insight Data Science fellowship last year after doing a PhD in Geology
and now works for a company that does forensic audits.
He uses a variety of machine learning and visualization packages,
and would now like to turn some of his own work into an open source project.
This book will show him how such a project should be organized
and how to encourage people to contribute to it.
became a competent programmer during a bachelor’s degree in applied math
and was then hired by the university’s research computing center.
The kinds of applications they are being asked to support
have shifted from fluid dynamics to data analysis;
this guide will teach them how to build and run data pipelines
so that they can pass those skills on to their users.
We organized the book around a running example:
the verification of Zipf’s Law
using a set of classic English novels
in an open, reproducible, and sustainable way.
(People often conflate these three ideas,
but they are distinct).
To do that,
we teach readers to do these things:
Use the Unix shell to efficiently manage data and code.
Organize small and medium-sized data science projects.
Write Python programs that can be used on the command line.
Use Git and GitHub to track and share work.
Work productively in a small team where everyone is welcome.
Use Make to automate complex workflows.
Enable users to configure software without modifying it directly.
Test software and know which parts have not yet been tested.
Find, handle, and fix errors in code.
Publish code and research in open and reproducible ways.
Create Python packages that can be installed in standard ways.
The order is important because later skills depend on earlier ones,
but also because we want people to be able to stop part way through
and still have a workable research process.
If you only get through the fourth chapter,
you’ll be able to back up your work,
share it with others,
and re-run analyses with a single command.
Another chapter and you’ll be ready to collaborate in a team;
one more after that,
and you’ll be able to capture your workflows in re-runnable ways.
We don’t believe any one book can serve everyone’s needs,
but we hope this one will help people who already know how to write a bit of code
figure out what to learn next and what “done” looks like.
The HTML version of the book will stay online for free and forever;
we’ll advertise the printed and e-book versions as soon as they become available.
If you find it useful,
please let us know
(and please also let us know about any errors or murky wording you stumble over).
In the second half of last year (and doesn’t it feel so good to call 2020 “last year”?)
a group of us created and translated some concept maps for various topics in data science.
What we have is online at https://github.com/rstudio/concept-maps/ under a Creative Commons license;
we hope you find them useful.
If you’d like to contribute, there are instructions on the site.
And if you’d like to make contributing much easier,
please take a look at this issue for http://diagrams.net:
maintaining translations would be a lot easier if we could store alternate sets of text in drawings…