Software Design by Example: A Build Manager

class: slide-title

<p>
  <a href="https://third-bit.com/sdxpy/">Software Design by Example</a>
</p>
<h1>A Build Manager</h1>
<div class="bottom">
  <a href="../">chapter</a>
</div>

---

## The Problem

-   `plot.py` produces `result.svg` from `collated.csv`

-   `analyze.py` produces `collated.csv` from `samples.csv` and `controls.csv`

-   Both `samples.csv` and `controls.csv` depends on `normalize.py` and `raw.csv`,
    but `normalize.py` takes a long time to run

-   How can we regenerate the files we need, but only when we need them?

---

## Make

-   <a class="gl-ref" href="../../glossary/#gl:build_manager" title="A program that keeps track of how files depend on one another and runs commands to update any files that are out-of-date. Build managers were invented to compile only those parts of programs that had changed but are now often used to implement workflows in which plots depend on results files, which in turn depend on raw data files or configuration files." markdown="1">Build managers</a>
    keep track of which files depend on which

-   First tool of this kind was [Make][gnu_make]

-   Many others now exist (e.g., [Snakemake][snakemake])

-   If a <a class="gl-ref" href="../../glossary/#gl:build_target" title="The file(s) that a build rule will update if they are out-of-date compared to their dependencies." markdown="1">target</a> is <a class="gl-ref" href="../../glossary/#gl:build_stale" title="To be out-of-date compared to a prerequisite. A build manager finds and updates things that are stale." markdown="1">stale</a>
    with respect to any of its  <a class="gl-ref" href="../../glossary/#gl:dependency" title="Something that a build target depends on." markdown="1">dependencies</a>,
    run a <a class="gl-ref" href="../../glossary/#gl:build_recipe" title="The part of a build rule that describes how to update something that has fallen out-of-date." markdown="1">recipe</a> to refresh it

-   Run recipes in order

-   Run each recipe at most once

---

## Terminology

-   Targets and dependencies must form a <a class="gl-ref" href="../../glossary/#gl:dag" title="A directed graph which does not contain any cycles (i.e., it is not possible to reach a node from itself by following edges)." markdown="1">directed acyclic graph</a>

-   A <a class="gl-ref" href="../../glossary/#gl:topological_order" title="Any ordering of the nodes in a graph that respects the direction of its edges, i.e., if there is an edge from node A to node B, A comes before B in the ordering. There may be many topological orderings of a particular graph." markdown="1">topological ordering</a> of a graph
    arranges nodes so that each one comes after everything it depends on

---

## Representing Rules

1.  Invent a special-purpose syntax
    -   Fits the problem
    -   But you need a parser, auto-complete, a debugger, etc.
2.  Use existing syntax
    -   Get tooling for free
    -   But the semantics are invisible to those tools
-   We will use JSON

```json
{
  "A": {"depends": ["B"], "rule": "build A"},
  "B": {"depends": [], "rule": "build B"}
}
```

---

## Top-Down Design

-   Start with the big picture

```py
class BuildBase:
    def build(self, config_file):
        config = self._configure(config_file)
        ordered = self._topo_sort(config)
        for node in ordered:
            self._refresh(config, node)

def _refresh(self, config, node):
        assert node in config, f"Unknown node {node}"
        print(config[node]["rule"])
```

1.  Get the configuration
2.  Figure out the update order
3.  Refresh each file (for now, just print action)

---

## Configuration

```py
def _configure(self, config_file):
    with open(config_file, "r") as reader:
        config = json.load(reader)
        known = set(config.keys())
        return {
            n: self._check(n, d, known)
            for n, d in config.items()
        }
```

-   Use a dictionary comprehension
    -   Key is node name (for lookup)
    -   Value contains rule and dependencies

---

## Check and Build

```py
def _check(self, name, details, known):
    assert "rule" in details, f"Missing rule for {name}"
    assert "depends" in details, f"Missing depends for {name}"
    depends = set(details["depends"])
    assert depends.issubset(known), \
        f"Unknown depends for {name}"
    return {"rule": details["rule"], "depends": depends}
```

---

## Topological Sorting

---

## Topological Sorting

```py
def _topo_sort(self, config):
    graph = {n: config[n]["depends"] for n in config}
    result = []
    while graph:
        available = {n for n in graph if not graph[n]}
        assert available, f"Circular graph {graph.keys()}"
        result.extend(available)
        graph = {
            n: graph[n] - available
            for n in graph
            if n not in available
        }
    return result
```

---

## Testing

```json
{
  "A": {"depends": ["B"], "rule": "build A"},
  "B": {"depends": [], "rule": "build B"}
}
```

```
build B
build A
```

---

## Critique

1.  Configuration might not come directly from a JSON file
    -   So modify constructor to take config as input
2.  Printing actions to the screen isn't very useful
    -   So collect them and return an ordered list of commands
3.  `assert` isn't a friendly way to handle user errors
    -   Raise `ValueError`
4.  Topological sort isn't <a class="gl-ref" href="../../glossary/#gl:stable_sort" title="A sorting algorithm that preserves the original order of items that are considered equal." markdown="1">stable</a>
    -   `dict` is ordered but `set` is not
    -   So sort node names when appending
5.  We might want to add other keys to rules
    -   So put that check in a separate method we can override

---

## Revise the Big Picture

```py
class BuildBetter:
    def build(self, config):
        config = self._configure(config)
        ordered = self._topo_sort(config)
        actions = []
        for node in ordered:
            self._refresh(config, node, actions)
        return actions

def _refresh(self, config, node, actions):
        assert node in config, f"Unknown node {node}"
        actions.append(config[node]["rule"])

def _must(self, condition, message):
        if not condition:
            raise ValueError(message)
```

---

## Revise Configuration

```py
def _configure(self, config):
    known = set(config.keys())
    return {n: self._check(n, d, known)
            for n, d in config.items()}

def _check(self, name, details, known):
    self._check_keys(name, details)
    depends = set(details["depends"])
    self._must(
        depends.issubset(known), f"Unknown depends for {name}"
    )
    result = details.copy()
    result["depends"] = depends
    return result

def _check_keys(self, name, details):
    self._must("rule" in details, f"Missing rule for {name}")
    self._must(
        "depends" in details, f"Missing depends for {name}"
    )
```

---

## Revise Topological Sort

```py
def _topo_sort(self, config):
    graph = {n: config[n]["depends"] for n in config}
    result = []
    while graph:
        available = {n for n in graph if not graph[n]}
        self._must(
            available,
            f"Circular graph {list(graph.keys())}",
        )
        result.extend(sorted(available))
        graph = {
            n: graph[n] - available
            for n in graph
            if n not in available
        }
    return result
```

---

## Now It's Testable

```py
def test_circular():
    action_A = "build A"
    action_B = "build B"
    config = {
        "A": {"depends": ["B"], "rule": action_A},
        "B": {"depends": ["A"], "rule": action_B},
    }
    try:
        Builder().build(config)
        assert False, "should have had exception"
    except ValueError:
        pass
```

---

## Now It's Testable

```py
def test_no_dep():
    action_A = "build A"
    action_B = "build B"
    config = {
        "A": {"depends": [], "rule": action_A},
        "B": {"depends": [], "rule": action_B},
    }
    assert Builder().build(config) == [action_A, action_B]
```

---

## And Extensible

```py
class BuildTime(BuildBetter):
    def _check_keys(self, name, details):
        super()._check_keys(name, details)
        self._must("time" in details, f"No time for {name}")

def _refresh(self, config, node, actions):
        assert node in config, f"Unknown node {node}"
        if self._needs_update(config, node):
            actions.append(config[node]["rule"])

def _needs_update(self, config, node):
        return any(
            config[node]["time"] < config[d]["time"]
            for d in config[node]["depends"]
        )
```

---

## More Testing

```py
def test_diamond_dep():
    action_A = "build A"
    action_B = "build B"
    action_C = "build C"
    action_D = "build D"
    config = {
        "A": {"depends": ["B", "C"], "rule": action_A, "time": 0},
        "B": {"depends": ["D"], "rule": action_B, "time": 0},
        "C": {"depends": ["D"], "rule": action_C, "time": 1},
        "D": {"depends": [], "rule": action_D, "time": 1},
    }
    assert Builder().build(config) == [action_B, action_A]
```

---

class: summary

## Summary

[gnu_make]: https://www.gnu.org/software/make/
[snakemake]: https://snakemake.readthedocs.io/