class: slide-title
Software Design by Example
A Build Manager
chapter
--- ## The Problem - `plot.py` produces `result.svg` from `collated.csv` - `analyze.py` produces `collated.csv` from `samples.csv` and `controls.csv` - Both `samples.csv` and `controls.csv` depends on `normalize.py` and `raw.csv`, but `normalize.py` takes a long time to run - How can we regenerate the files we need, but only when we need them? --- ## Make -
Build managers
keep track of which files depend on which - First tool of this kind was [Make][gnu_make] - Many others now exist (e.g., [Snakemake][snakemake]) - If a
target
is
stale
with respect to any of its
dependencies
, run a
recipe
to refresh it - Run recipes in order - Run each recipe at most once --- ## Terminology - Targets and dependencies must form a
directed acyclic graph
- A
topological ordering
of a graph arranges nodes so that each one comes after everything it depends on
--- ## Representing Rules 1. Invent a special-purpose syntax - Fits the problem - But you need a parser, auto-complete, a debugger, etc. 2. Use existing syntax - Get tooling for free - But the semantics are invisible to those tools - We will use JSON ```json { "A": {"depends": ["B"], "rule": "build A"}, "B": {"depends": [], "rule": "build B"} } ``` --- ## Top-Down Design - Start with the big picture ```py class BuildBase: def build(self, config_file): config = self._configure(config_file) ordered = self._topo_sort(config) for node in ordered: self._refresh(config, node) def _refresh(self, config, node): assert node in config, f"Unknown node {node}" print(config[node]["rule"]) ``` -- 1. Get the configuration 2. Figure out the update order 3. Refresh each file (for now, just print action) --- ## Configuration ```py def _configure(self, config_file): with open(config_file, "r") as reader: config = json.load(reader) known = set(config.keys()) return { n: self._check(n, d, known) for n, d in config.items() } ``` - Use a dictionary comprehension - Key is node name (for lookup) - Value contains rule and dependencies --- ## Check and Build ```py def _check(self, name, details, known): assert "rule" in details, f"Missing rule for {name}" assert "depends" in details, f"Missing depends for {name}" depends = set(details["depends"]) assert depends.issubset(known), \ f"Unknown depends for {name}" return {"rule": details["rule"], "depends": depends} ```
--- ## Topological Sorting
--- ## Topological Sorting ```py def _topo_sort(self, config): graph = {n: config[n]["depends"] for n in config} result = [] while graph: available = {n for n in graph if not graph[n]} assert available, f"Circular graph {graph.keys()}" result.extend(available) graph = { n: graph[n] - available for n in graph if n not in available } return result ``` --- ## Testing ```json { "A": {"depends": ["B"], "rule": "build A"}, "B": {"depends": [], "rule": "build B"} } ``` ``` build B build A ``` --- ## Critique 1. Configuration might not come directly from a JSON file - So modify constructor to take config as input 2. Printing actions to the screen isn't very useful - So collect them and return an ordered list of commands 3. `assert` isn't a friendly way to handle user errors - Raise `ValueError` 4. Topological sort isn't
stable
- `dict` is ordered but `set` is not - So sort node names when appending 5. We might want to add other keys to rules - So put that check in a separate method we can override --- ## Revise the Big Picture ```py class BuildBetter: def build(self, config): config = self._configure(config) ordered = self._topo_sort(config) actions = [] for node in ordered: self._refresh(config, node, actions) return actions def _refresh(self, config, node, actions): assert node in config, f"Unknown node {node}" actions.append(config[node]["rule"]) def _must(self, condition, message): if not condition: raise ValueError(message) ``` --- ## Revise Configuration ```py def _configure(self, config): known = set(config.keys()) return {n: self._check(n, d, known) for n, d in config.items()} def _check(self, name, details, known): self._check_keys(name, details) depends = set(details["depends"]) self._must( depends.issubset(known), f"Unknown depends for {name}" ) result = details.copy() result["depends"] = depends return result def _check_keys(self, name, details): self._must("rule" in details, f"Missing rule for {name}") self._must( "depends" in details, f"Missing depends for {name}" ) ``` --- ## Revise Topological Sort ```py def _topo_sort(self, config): graph = {n: config[n]["depends"] for n in config} result = [] while graph: available = {n for n in graph if not graph[n]} self._must( available, f"Circular graph {list(graph.keys())}", ) result.extend(sorted(available)) graph = { n: graph[n] - available for n in graph if n not in available } return result ``` --- ## Now It's Testable ```py def test_circular(): action_A = "build A" action_B = "build B" config = { "A": {"depends": ["B"], "rule": action_A}, "B": {"depends": ["A"], "rule": action_B}, } try: Builder().build(config) assert False, "should have had exception" except ValueError: pass ``` --- ## Now It's Testable ```py def test_no_dep(): action_A = "build A" action_B = "build B" config = { "A": {"depends": [], "rule": action_A}, "B": {"depends": [], "rule": action_B}, } assert Builder().build(config) == [action_A, action_B] ``` --- ## And Extensible ```py class BuildTime(BuildBetter): def _check_keys(self, name, details): super()._check_keys(name, details) self._must("time" in details, f"No time for {name}") def _refresh(self, config, node, actions): assert node in config, f"Unknown node {node}" if self._needs_update(config, node): actions.append(config[node]["rule"]) def _needs_update(self, config, node): return any( config[node]["time"] < config[d]["time"] for d in config[node]["depends"] ) ``` --- ## More Testing ```py def test_diamond_dep(): action_A = "build A" action_B = "build B" action_C = "build C" action_D = "build D" config = { "A": {"depends": ["B", "C"], "rule": action_A, "time": 0}, "B": {"depends": ["D"], "rule": action_B, "time": 0}, "C": {"depends": ["D"], "rule": action_C, "time": 1}, "D": {"depends": [], "rule": action_D, "time": 1}, } assert Builder().build(config) == [action_B, action_A] ``` --- class: summary ## Summary
[gnu_make]: https://www.gnu.org/software/make/ [snakemake]: https://snakemake.readthedocs.io/