Replacing Make
I’d like to find a laptop-scale build/workflow tool to teach to data scientists instead of Make. I’ve looked at Snakemake, doit, invoke, Metaflow, and several others, but none quite meet my criteria:
-
Workflow descriptions written in pure Python (not a syntactic superset and not YAML).
-
Supports file dependencies and pattern rules (see the example below).
-
Easy to see which files are required or produced by which rules.
-
Can run shell commands as well as Python code. (Yes, I know that anything capable of running Python can use it to launch a shell command, but I’d like that out of the box.)
-
Actively maintained.
Here’s the kind of example I’d like to translate from Make to whatever; if you know of something that’ll do the job, please give me a shout.
# Pattern-match on all CSV files in the 'raw' directory.
RAW_FILES := $(wildcard raw/*.csv)
# Translate the names of the raw files into names of cooked files.
DATA_FILES := $(patsubst raw/%.csv,data/%.csv,${RAW_FILES})
# Define a single value for a variable explicitly.
SUMMARY_FILE := data/summary.csv
# Tell Make that 'all' and 'clean' don't alter any files.
.PHONY: all clean
# Define a default target that depends on the summary file, so that
# 'make' or 'make all' will try to regenerate it.
all: ${SUMMARY_FILE}
# Regenerate the summary file from the cooked data files. This
# command only runs if the summary is out of date compared to any of
# the cooked data files.
${SUMMARY_FILE}: ${DATA_FILES} bin/summarize.py
python bin/summarize.py ${DATA_FILES} > ${SUMMARY_FILE}
# Pattern rule to regenerate a cooked data file from a raw data
# file. The syntax of Make's automatic variables is horrible, but it's
# hard to argue with the simplicity of the rule itsef.
data/%.csv: raw/%.csv bin/cook.py
python bin/cook.py < $< > $@
# Clean up.
clean:
rm -f ${SUMMARY_FILE} ${DATA_FILES}