Software Design by Example: A Database

class: slide-title

<p>
  <a href="https://third-bit.com/sdxpy/">Software Design by Example</a>
</p>
<h1>A Database</h1>
<div class="bottom">
  <a href="../">chapter</a>
</div>

---

## The Problem

-   Persisting objects (<a class="x-ref" href="../../persist/">Chapter 16</a>) lets us save and restore program state

-   But we often want fast lookup *without* reloading all the data

-   And interoperability across languages

-   Create a simple <a class="gl-ref" href="../../glossary/#gl:log_structured_db" title="A database to which data can only be appended, i.e., existing records cannot be overwritten." markdown="1">log-structured database</a>

---

## Starting Point

-   A simple <a class="gl-ref" href="../../glossary/#gl:key_value_store" title="A simple form of database in which each record can only be accessed by a single key." markdown="1">key-value store</a> that lets us look things up

-   User must provide a function that gets key from record

```py
class Database:
    def __init__(self, key_func):
        """Initialize with function to get key."""
        self._key_func = key_func

def add(self, record):
        """Store the given record."""
        raise NotImplementedError("add")

def get(self, key):
        """Return record associated with key or None."""
        raise NotImplementedError("get")
```

---

## Just a Dictionary

-   Store in memory using a dictionary

```py
from interface_original import Database

class JustDict(Database):
    def __init__(self, key_func):
        super().__init__(key_func)
        self._data = {}

def add(self, record):
        key = self._key_func(record)
        self._data[key] = record

def get(self, key):
        return self._data.get(key, None)
```

-   Lets us start writing tests

---

## Experimental Records

```py
class BasicRec:
    MAX_NAME_LEN = 6     # length of name in chars
    TIMESTAMP_LEN = 8    # length of timestamp in chars
    MAX_READING = 10     # maximum reading value
    MAX_READING_LEN = 2  # length of reading in chars
    MAX_READINGS_NUM = 2 # maximum number of readings

@staticmethod
    def key(record):
        assert isinstance(record, BasicRec)
        return record._name

def __init__(self, name, timestamp, readings):
        assert 0 < len(name) <= self.MAX_NAME_LEN
        assert 0 <= len(readings) <= self.MAX_READINGS_NUM
        assert all((0 <= r <= self.MAX_READING) for r in readings)
        self._name = name
        self._timestamp = timestamp
        self._readings = readings

```

---

## Test Fixtures

-   Use the `pytest.fixture` decorator from <a class="x-ref" href="../../func/">Chapter 8</a>

```py
@pytest.fixture
def db():
    return JustDict(BasicExperiment.key)

@pytest.fixture
def ex01():
    return BasicExperiment("ex01", 12345, [1, 2])

@pytest.fixture
def ex02():
    return BasicExperiment("ex02", 67890, [3, 4])
```

---

## Tests

```py
def test_construct(db):
    assert db

def test_get_nothing_from_empty_db(db):
    assert db.get("something") is None

def test_add_then_get(db, ex01):
    db.add(ex01)
    assert db.get("ex01") == ex01

def test_add_two_then_get_both(db, ex01, ex02):
    db.add(ex01)
    db.add(ex02)
    assert db.get("ex01") == ex01
    assert db.get("ex02") == ex02

def test_add_then_overwrite(db, ex01):
    db.add(ex01)
    ex01._timestamp = 67890
    db.add(ex01)
    assert db.get("ex01") == ex01
```

---

## Refactor Interface

-   We're going to need other record manipulation functions

-   So save the record class instead of the key function

```py
class Database:
    def __init__(self, record_cls):
        """Initialize with data manipulation functions."""
        self._record_cls = record_cls

def add(self, record):
        """Store the given record."""
        raise NotImplementedError("add")

def get(self, key):
        """Return record associated with key or None."""
        raise NotImplementedError("get")
```

---

## Refactor Database

-   Corresponding change to use a <a class="gl-ref" href="../../glossary/#gl:static_method" title="A function that is defined within a class but does not require either the class itself or an instance of the class as a parameter." markdown="1">static method</a>
    of the record class

```py
from interface import Database

class JustDictRefactored(Database):
    def __init__(self, record_cls):
        super().__init__(record_cls)
        self._data = {}

def add(self, record):
        key = self._record_cls.key(record)
        self._data[key] = record

def get(self, key):
        return self._data.get(key, None)
```

---

## Saving Records

-   Records must know how to pack and unpack themselves

-   Start by calculating the size of each

```py
class Experiment(BasicRec):
    RECORD_LEN = BasicRec.MAX_NAME_LEN + 1 \
        + BasicRec.TIMESTAMP_LEN + 1 \
        + (BasicRec.MAX_READING_LEN * BasicRec.MAX_READINGS_NUM) \
        + (BasicRec.MAX_READINGS_NUM - 1)

@staticmethod
    def size():
        return Experiment.RECORD_LEN
```

---

## Packing

```py
@staticmethod
def pack(record):
    assert isinstance(record, Experiment)
    readings = "\0".join(str(r) for r in record._readings)
    result = f"{record._name}\0{record._timestamp}\0{readings}"
    if len(result) < Experiment.RECORD_LEN:
        result += "\0" * (Experiment.RECORD_LEN - len(result))
    return result
```

-   Save as strings with <a class="gl-ref" href="../../glossary/#gl:null_byte" title="A byte with the value zero. Null bytes are used to mark the ends of strings in C and C++, and are sometimes used to fill unused space in fixed-size binary records." markdown="1">null byte</a> `\0` between them

-   A real implementation would pack as binary (<a class="x-ref" href="../../binary/">Chapter 17</a>)

---

## Unpacking

```py
@staticmethod
def unpack(raw):
    assert isinstance(raw, str)
    parts = raw.split("\0")
    name = parts[0]
    timestamp = int(parts[1])
    readings = [int(r) for r in parts[2:] if len(r)]
    return Experiment(name, timestamp, readings)
```

-   Note: this doesn't handle strings with null bytes

-   A real implementation would etc.

-   Methods for packing and unpacking multiple records are straightforward

---

## A File-Backed Database

---

## A File-Backed Database

```py
class FileBacked(Database):
    def __init__(self, record_cls, filename):
        super().__init__(record_cls)
        self._filename = Path(filename)
        if not self._filename.exists():
            self._filename.touch()
        self._load()

def add(self, record):
        key = self._record_cls.key(record)
        self._data[key] = record
        self._save()

def get(self, key):
        return self._data.get(key, None)
```

-   Needs two <a class="gl-ref" href="../../glossary/#gl:helper_method" title="A method designed to be used only by other methods in the same class. Helper methods are usually created to keep other methods short and readable." markdown="1">helper methods</a>

---

## A File-Backed Database

```py
def _save(self):
    packed = self._record_cls.pack_multi(self._data.values())
    with open(self._filename, "w") as writer:
        writer.write(packed)

def _load(self):
    assert self._filename.exists()
    with open(self._filename, "r") as reader:
        raw = reader.read()
    records = self._record_cls.unpack_multi(raw)
    self._data = {self._record_cls.key(r): r for r in records}
```

-   Still saving and loading entire database

-   But look at all the infrastructure we've built

---

## Saving Blocks

-   Save *N* records per <a class="gl-ref" href="../../glossary/#gl:block_memory" title="A region of memory of a fixed, constant size. Data is often divided into blocks to optimize input and output at the hardware level; software is then used to convert between blocks and streams." markdown="1">block</a>

-   Keep the <a class="gl-ref" href="../../glossary/#gl:index_database" title="An auxiliary data structure in a database used to speed up search for some entries. An index increases memory and disk requirements but reduces search time." markdown="1">index</a> in memory

-   When writing, only modify one block (smaller and faster)

-   When reading, only load one block (ditto)

---

## Allocating Blocks

---

## Store Blocks in Memory

```py
class Blocked(Database):
    RECORDS_PER_BLOCK = 2

@staticmethod
    def size():
        return Blocked.RECORDS_PER_BLOCK

def __init__(self, record_cls):
        super().__init__(record_cls)
        self._next = 0
        self._index = {}
        self._blocks = []

def num_blocks(self):
        return len(self._blocks)

def num_records(self):
        return len(self._index)
```

---

## Adding a Record

```py
def add(self, record):
    key = self._record_cls.key(record)
    seq_id = self._next_seq_id()
    self._index[key] = seq_id
    block_id = self._get_block_id(seq_id)
    block = self._get_block(block_id)
    block[seq_id] = record
```

-   Get the sequence ID for this record

-   Store the key-to-sequence mapping in the index

-   Find or create the right block

-   Add the record

---

## Getting a Record

```py
def get(self, key):
    if key not in self._index:
        return None
    seq_id = self._index[key]
    block_id = self._get_block_id(seq_id)
    block = self._get_block(block_id)
    return block[seq_id]
```

-   Do we even know about this record?

-   Find its current sequence ID

-   Find the corresponding block

-   Get the record

---

## Helper Methods

```py
def _next_seq_id(self):
    seq_id = self._next
    self._next += 1
    return seq_id

def _get_block_id(self, seq_id):
    return seq_id // Blocked.RECORDS_PER_BLOCK

def _get_block(self, block_id):
    while block_id >= len(self._blocks):
        self._blocks.append({})
    return self._blocks[block_id]
```

---

## Persisting Blocks

-   Use inheritance to do everything described above while saving and loading blocks

```py
class BlockedFile(Blocked):
    def __init__(self, record_cls, db_dir):
        super().__init__(record_cls)
        self._db_dir = Path(db_dir)
        self._build_index()

def add(self, record):
        super().add(record)
        self._save(record)

def get(self, key):
        if key not in self._index:
            return None
        self._load(key)
        return super().get(key)
```

---

## Saving

```py
def _save(self, record):
    key = self._record_cls.key(record)
    seq_id = self._index[key]
    block_id = self._get_block_id(seq_id)

block = self._get_block(block_id)
    packed = self._record_cls.pack_multi(block.values())

filename = self._get_filename(block_id)
    with open(filename, "w") as writer:
        writer.write(''.join(packed))
```

-  Have to pack and save all the records in the block

---

## Loading

```py
def _load(self, key):
    seq_id = self._index[key]
    block_id = self._get_block_id(seq_id)
    filename = self._get_filename(block_id)
    self._load_block(block_id, filename)

def _load_block(self, block_id, filename):
    with open(filename, "r") as reader:
        raw = reader.read()

records = self._record_cls.unpack_multi(raw)
    base = self.size() * block_id
    block = self._get_block(block_id)
    for i, r in enumerate(records):
        block[base + i] = r
```

-   Unpack all the records in the block

---

## Why Split Loading?

-   Need to initialize the in-memory index when restarting the database

```py
def _build_index(self):
    seq_id = 0
    for (block_id, filename) in enumerate(
            sorted(self._db_dir.iterdir())
    ):
        self._load_block(block_id, filename)
        for record in self._get_block(block_id).values():
            key = self._record_cls.key(record)
            self._index[key] = seq_id
            seq_id += 1
```

-   Obvious extension: save the index in another file

-   Would have to profile (<a class="x-ref" href="../../perf/">Chapter 15</a>) to see if this was worthwhile

---

## Next Steps

-   Clean up unused files

-   <a class="gl-ref" href="../../glossary/#gl:compact" title="To pack data so as to remove wasted or unused space." markdown="1">Compact</a> storage periodically

-   Use other data structures for indexing

---

class: summary

## Summary