class: slide-title
Software Design by Example
A Database
chapter
--- ## The Problem - Persisting objects (
Chapter 16
) lets us save and restore program state - But we often want fast lookup *without* reloading all the data - And interoperability across languages - Create a simple
log-structured database
--- ## Starting Point - A simple
key-value store
that lets us look things up - User must provide a function that gets key from record ```py class Database: def __init__(self, key_func): """Initialize with function to get key.""" self._key_func = key_func def add(self, record): """Store the given record.""" raise NotImplementedError("add") def get(self, key): """Return record associated with key or None.""" raise NotImplementedError("get") ``` --- ## Just a Dictionary - Store in memory using a dictionary ```py from interface_original import Database class JustDict(Database): def __init__(self, key_func): super().__init__(key_func) self._data = {} def add(self, record): key = self._key_func(record) self._data[key] = record def get(self, key): return self._data.get(key, None) ``` - Lets us start writing tests --- ## Experimental Records ```py class BasicRec: MAX_NAME_LEN = 6 # length of name in chars TIMESTAMP_LEN = 8 # length of timestamp in chars MAX_READING = 10 # maximum reading value MAX_READING_LEN = 2 # length of reading in chars MAX_READINGS_NUM = 2 # maximum number of readings @staticmethod def key(record): assert isinstance(record, BasicRec) return record._name def __init__(self, name, timestamp, readings): assert 0 < len(name) <= self.MAX_NAME_LEN assert 0 <= len(readings) <= self.MAX_READINGS_NUM assert all((0 <= r <= self.MAX_READING) for r in readings) self._name = name self._timestamp = timestamp self._readings = readings ``` --- ## Test Fixtures - Use the `pytest.fixture` decorator from
Chapter 8
```py @pytest.fixture def db(): return JustDict(BasicExperiment.key) @pytest.fixture def ex01(): return BasicExperiment("ex01", 12345, [1, 2]) @pytest.fixture def ex02(): return BasicExperiment("ex02", 67890, [3, 4]) ``` --- ## Tests ```py def test_construct(db): assert db def test_get_nothing_from_empty_db(db): assert db.get("something") is None def test_add_then_get(db, ex01): db.add(ex01) assert db.get("ex01") == ex01 def test_add_two_then_get_both(db, ex01, ex02): db.add(ex01) db.add(ex02) assert db.get("ex01") == ex01 assert db.get("ex02") == ex02 def test_add_then_overwrite(db, ex01): db.add(ex01) ex01._timestamp = 67890 db.add(ex01) assert db.get("ex01") == ex01 ``` --- ## Refactor Interface - We're going to need other record manipulation functions - So save the record class instead of the key function ```py class Database: def __init__(self, record_cls): """Initialize with data manipulation functions.""" self._record_cls = record_cls def add(self, record): """Store the given record.""" raise NotImplementedError("add") def get(self, key): """Return record associated with key or None.""" raise NotImplementedError("get") ``` --- ## Refactor Database - Corresponding change to use a
static method
of the record class ```py from interface import Database class JustDictRefactored(Database): def __init__(self, record_cls): super().__init__(record_cls) self._data = {} def add(self, record): key = self._record_cls.key(record) self._data[key] = record def get(self, key): return self._data.get(key, None) ``` --- ## Saving Records - Records must know how to pack and unpack themselves - Start by calculating the size of each ```py class Experiment(BasicRec): RECORD_LEN = BasicRec.MAX_NAME_LEN + 1 \ + BasicRec.TIMESTAMP_LEN + 1 \ + (BasicRec.MAX_READING_LEN * BasicRec.MAX_READINGS_NUM) \ + (BasicRec.MAX_READINGS_NUM - 1) @staticmethod def size(): return Experiment.RECORD_LEN ``` --- ## Packing ```py @staticmethod def pack(record): assert isinstance(record, Experiment) readings = "\0".join(str(r) for r in record._readings) result = f"{record._name}\0{record._timestamp}\0{readings}" if len(result) < Experiment.RECORD_LEN: result += "\0" * (Experiment.RECORD_LEN - len(result)) return result ``` - Save as strings with
null byte
`\0` between them - A real implementation would pack as binary (
Chapter 17
) --- ## Unpacking ```py @staticmethod def unpack(raw): assert isinstance(raw, str) parts = raw.split("\0") name = parts[0] timestamp = int(parts[1]) readings = [int(r) for r in parts[2:] if len(r)] return Experiment(name, timestamp, readings) ``` - Note: this doesn't handle strings with null bytes - A real implementation would etc. -- - Methods for packing and unpacking multiple records are straightforward --- ## A File-Backed Database
--- ## A File-Backed Database ```py class FileBacked(Database): def __init__(self, record_cls, filename): super().__init__(record_cls) self._filename = Path(filename) if not self._filename.exists(): self._filename.touch() self._load() def add(self, record): key = self._record_cls.key(record) self._data[key] = record self._save() def get(self, key): return self._data.get(key, None) ``` - Needs two
helper methods
--- ## A File-Backed Database ```py def _save(self): packed = self._record_cls.pack_multi(self._data.values()) with open(self._filename, "w") as writer: writer.write(packed) def _load(self): assert self._filename.exists() with open(self._filename, "r") as reader: raw = reader.read() records = self._record_cls.unpack_multi(raw) self._data = {self._record_cls.key(r): r for r in records} ``` - Still saving and loading entire database - But look at all the infrastructure we've built --- ## Saving Blocks - Save *N* records per
block
- Keep the
index
in memory - When writing, only modify one block (smaller and faster) - When reading, only load one block (ditto) --- ## Allocating Blocks
--- ## Store Blocks in Memory ```py class Blocked(Database): RECORDS_PER_BLOCK = 2 @staticmethod def size(): return Blocked.RECORDS_PER_BLOCK def __init__(self, record_cls): super().__init__(record_cls) self._next = 0 self._index = {} self._blocks = [] def num_blocks(self): return len(self._blocks) def num_records(self): return len(self._index) ``` --- ## Adding a Record ```py def add(self, record): key = self._record_cls.key(record) seq_id = self._next_seq_id() self._index[key] = seq_id block_id = self._get_block_id(seq_id) block = self._get_block(block_id) block[seq_id] = record ``` - Get the sequence ID for this record - Store the key-to-sequence mapping in the index - Find or create the right block - Add the record --- ## Getting a Record ```py def get(self, key): if key not in self._index: return None seq_id = self._index[key] block_id = self._get_block_id(seq_id) block = self._get_block(block_id) return block[seq_id] ``` - Do we even know about this record? - Find its current sequence ID - Find the corresponding block - Get the record --- ## Helper Methods ```py def _next_seq_id(self): seq_id = self._next self._next += 1 return seq_id def _get_block_id(self, seq_id): return seq_id // Blocked.RECORDS_PER_BLOCK def _get_block(self, block_id): while block_id >= len(self._blocks): self._blocks.append({}) return self._blocks[block_id] ``` --- ## Persisting Blocks - Use inheritance to do everything described above while saving and loading blocks ```py class BlockedFile(Blocked): def __init__(self, record_cls, db_dir): super().__init__(record_cls) self._db_dir = Path(db_dir) self._build_index() def add(self, record): super().add(record) self._save(record) def get(self, key): if key not in self._index: return None self._load(key) return super().get(key) ``` --- ## Saving ```py def _save(self, record): key = self._record_cls.key(record) seq_id = self._index[key] block_id = self._get_block_id(seq_id) block = self._get_block(block_id) packed = self._record_cls.pack_multi(block.values()) filename = self._get_filename(block_id) with open(filename, "w") as writer: writer.write(''.join(packed)) ``` - Have to pack and save all the records in the block --- ## Loading ```py def _load(self, key): seq_id = self._index[key] block_id = self._get_block_id(seq_id) filename = self._get_filename(block_id) self._load_block(block_id, filename) def _load_block(self, block_id, filename): with open(filename, "r") as reader: raw = reader.read() records = self._record_cls.unpack_multi(raw) base = self.size() * block_id block = self._get_block(block_id) for i, r in enumerate(records): block[base + i] = r ``` - Unpack all the records in the block --- ## Why Split Loading? - Need to initialize the in-memory index when restarting the database ```py def _build_index(self): seq_id = 0 for (block_id, filename) in enumerate( sorted(self._db_dir.iterdir()) ): self._load_block(block_id, filename) for record in self._get_block(block_id).values(): key = self._record_cls.key(record) self._index[key] = seq_id seq_id += 1 ``` - Obvious extension: save the index in another file - Would have to profile (
Chapter 15
) to see if this was worthwhile --- ## Next Steps - Clean up unused files -
Compact
storage periodically - Use other data structures for indexing --- class: summary ## Summary