class: slide-title
Software Design by Example
Parsing Text
chapter
--- ## The Problem - `"2023-*{pdf,txt}"` is easier to read and write than `Lit("2023-", Any(Either("pdf", "txt")))` - How can we translate the former into the latter? -- 1. Group characters into
tokens
2. Use tokens to create an
abstract syntax tree
--- ## Cases - Characters like `{` and `*` can be processed immediately - But "regular" characters need to be accumulated - `Lit("abc")` rather than `Lit("a", Lit("b", Lit("c")))` - When we encounter a special character or '}', we close the current literal token - The `,` character closes a literal but doesn't produce a token --- ## A Bit More Design - The result is the final (flat) list of tokens - We could pass around a list and append to it - But we also need to know the characters in each `Literal` and the options in each `Either` - So create a class rather than a function - Easier than carrying state around explicitly --- ## Tokenizer ```py def tok(self, text): self._setup() for ch in text: if ch == "*": self._add("Any") elif ch == "{": self._add("EitherStart") elif ch == ",": self._add(None) elif ch == "}": self._add("EitherEnd") elif ch in CHARS: self.current += ch else: raise NotImplementedError(f"what is '{ch}'?") self._add(None) return self.result ``` --- ## Tokenizer - Call `self._setup()` at the start so that the tokenizer can be re-used - *Don't* call `self._add()` for regular characters - Add literals when we see special characters - And after all the input has been parsed --- ## Adding Tokens ```py def _add(self, thing): if len(self.current) > 0: self.result.append(["Lit", self.current]) self.current = "" if thing is not None: self.result.append([thing]) ``` - `self._add(None)` means "add the literal but nothing else"
--- ## Testing ```py def test_tok_empty_string(): assert Tokenizer().tok("") == [] def test_tok_any_either(): assert Tokenizer().tok("*{abc,def}") == [ ["Any"], ["EitherStart"], ["Lit", "abc"], ["Lit", "def"], ["EitherEnd"], ] ``` - Each sub-list represents one token --- ## Parsing ```py def _parse(self, tokens): if not tokens: return Null() front, back = tokens[0], tokens[1:] if front[0] == "Any": handler = self._parse_Any elif front[0] == "EitherStart": handler = self._parse_EitherStart elif front[0] == "Lit": handler = self._parse_Lit else: assert False, f"Unknown token type {front}" return handler(front[1:], back) ``` - `front[0]` is the token's name, `front[1:]` is any other data - `back` is the remaining tokens - Look for a
\_parse\_
thing
method to handle each token --- class: aside ## Introspection and Dispatch -
Introspection
: having a program look up a function or method inside itself while it is running -
Dynamic dispatch
: using introspection to decide what to do next rather than a long chain of `if` statements - These are powerful techniques and we use them frequently --- ## Fill in the Simple Stuff ```py def _parse_Any(self, rest, back): return Any(self._parse(back)) def _parse_Lit(self, rest, back): return Lit(rest[0], self._parse(back)) ``` - Hardest part is making sure to name them properly so that `_parse` can find them --- ## `Either` is Messy ```py def _parse_EitherStart(self, rest, back): if ( len(back) < 3 or (back[0][0] != "Lit") or (back[1][0] != "Lit") or (back[2][0] != "EitherEnd") ): raise ValueError("badly-formatted Either") left = Lit(back[0][1]) right = Lit(back[1][1]) return Either([left, right], self._parse(back[3:])) ``` - Remember, we didn't save the commas -- - It really should pull things from `back` until it hits `EitherEnd` --- ## A Better Implementation ```py def _parse_EitherStart(self, rest, back): children = [] while back and (back[0][0] == "Lit"): children.append(Lit(back[0][1])) back = back[1:] if not children: raise ValueError("empty Either") if back[0][0] != "EitherEnd": raise ValueError("badly-formatted Either") return Either(children, self._parse(back[1:])) ``` --- ## Testing ```py def test_parse_either_two_lit(): assert Parser().parse("{abc,def}") == Either( [Lit("abc"), Lit("def")] ) ``` - But this assumes we can compare `Match` objects --- class: aside ## They're Just Methods - `a == b` is "just" `a.__eq__(b)` -
Operator overloading
: if our class has methods with the right names, Python calls them to perform "built-in" operations - Parent `Match` class does shared work - E.g., make sure objects have the same
concrete class
- Child method (if any) does details - E.g., make sure two `Lit` objects are checking for the same text --- ## Infrastructure ```py class Match: def __init__(self, rest): self.rest = rest if rest else Null() def __eq__(self, other): return (other is not None and self.__class__ == other.__class__ and self.rest == other.rest) class Lit(Match): def __init__(self, chars, rest=None): super().__init__(rest) self.chars = chars def __eq__(self, other): return super().__eq__(other) and ( self.chars == other.chars ) ``` --- class: summary ## Summary