Glob Pattern Matcher
- Shell-style glob patterns like
*.gleamanddata?.csvcan be represented as lists of typed elements. - We can parse the usual string representation to create such lists.
- Recursive pattern matching consumes one element at a time.
- The
Wildcardcase requires backtracking: try skipping the wildcard, and if that fails, consume one input character and try again. - Use
gleeunitto write unit tests for Gleam.
The Problem
- Command-line shells use patterns like
*.gleam,data?.csv, andreport_2024*to match groups of filenames- These patterns are called globs (short for "global")
-
Three rules cover everything a basic glob matcher needs:
-
*matches any sequence of characters, including the empty sequence ?matches any single character-
Everything else matches exactly itself
-
Writing a glob matcher uses custom types, recursion, and pattern matching in a single small program
Representing Pattern Elements
- First design decision is how to represent a parsed pattern
- A
Stringwould lose the structure: we could not easily tell a literal asterisk from a wildcard metacharacter. - A custom type makes each case explicit:
pub type Elem {
Literal(String)
AnyChar
Wildcard
}
Literal(String)holds a single character that must match exactlyAnyCharmatches any one character and holds no dataWildcardmatches zero or more characters and holds no data- Adding a new rule (e.g., character classes) means adding a new variant,
and the compiler will flag every
casethat does not handle it List(Elem)is the natural representation for a whole pattern[Wildcard, Literal(".gl")]represents*.gl
- Get matching to work, then build a parser to turn strings into these lists
Matching a Pattern Against Text
- Matching consumes elements and characters in parallel:
pub fn match_pattern(pattern: List(Elem), text: String) -> Bool {
match_pattern_chars(pattern, string_to_chars(text))
}
fn match_pattern_chars(pattern: List(Elem), chars: List(String)) -> Bool {
case pattern, chars {
[], [] -> True
[], _ -> False
[Literal(c), ..pat_rest], _ ->
case drop_prefix(string.to_graphemes(c), chars) {
Ok(remaining) -> match_pattern_chars(pat_rest, remaining)
Error(_) -> False
}
[AnyChar, ..pat_rest], [_, ..char_rest] ->
match_pattern_chars(pat_rest, char_rest)
[Wildcard, ..pat_rest], _ -> {
case match_pattern_chars(pat_rest, chars) {
True -> True
False -> {
case chars {
[] -> False
[_, ..char_rest] -> match_pattern_chars(pattern, char_rest)
}
}
}
}
_, _ -> False
}
}
match_patternconverts the input string to a list of graphemes and delegates tomatch_pattern_charsmatch_pattern_charsdispatches on the pair(pattern, chars):- Both empty: success
- Pattern empty but input remains: failure (we required more)
Literal(c)with a matching character: advance both and recurseAnyCharwith any character: advance both and recurseWildcard: the two-branch case described below- Anything else (literal mismatch, empty input when pattern not empty): failure
Handling Wildcards
- The
Wildcardbranch is the most interesting one - A wildcard can match zero characters (the pattern continues from the same position) or one or more characters (the input advances by one)
- The code tries the zero-match option first:
[Wildcard, ..pat_rest], _ -> {
case match_pattern_chars(pat_rest, chars) {
True -> True
False -> {
case chars {
[] -> False
[_, ..char_rest] -> match_pattern_chars(pattern, char_rest)
}
}
}
}
- Try
match_pattern_chars(prest, chars): does the rest of the pattern match the current input without consuming anything for the wildcard? - If yes, return
True - If no and input is exhausted, return
False -
If no and input remains, consume one character and retry with the same pattern
- The wildcard is still active
-
This is depth-first backtracking
- For most practical inputs it is fast enough
- Pathological patterns like
*a*a*a*a*a*bagainst a long string ofas can be slow, but that is a known trade-off in backtracking matchers
- Pathological patterns like
Parsing Patterns
src/match_demo.gleamconstructs pattern lists by hand- For real use we need a function that turns
"*.gleam"into[Wildcard, Literal(".gleam")]
fn parse_pattern(s: String) -> List(Elem) {
let chars = string.to_graphemes(s)
parse_chars(chars, [])
}
fn parse_chars(chars: List(String), acc: List(Elem)) -> List(Elem) {
case chars {
[] -> list.reverse(acc)
["*", ..rest] -> parse_chars(rest, [Wildcard, ..acc])
["?", ..rest] -> parse_chars(rest, [AnyChar, ..acc])
_ -> {
let #(literal, remaining) = take_literal(chars, [])
parse_chars(remaining, [Literal(literal), ..acc])
}
}
}
parse_patternconverts the string to graphemes and calls the recursiveparse_chars- This pattern of "entry point function calls recursive function" is used a lot
parse_charsaccumulates elements in reverse and reverses at the end"*"becomesWildcard,"?"becomesAnyChar- Consecutive non-special characters are collected into a single
Literalbytake_literal- Prevents
Literal("h"),Literal("e"),Literal("l"),Literal("l"),Literal("o")
- Prevents
- Notice that
Literalholds aString, not aUtfCodepoint- This lets one
Literalrepresent a multi-character run, which is more efficient and easier to read in debug output
- This lets one
Running the Examples
gleam run --module match_demo
[Wildcard, Literal(".gleam")]
[Literal("data"), AnyChar, Literal(".csv")]
[Literal("hello")]
- The first three test
*.gleamagainst:hello.gleam(match)hello.rs(no match).gleam(match, because*can match zero characters)
- The last three test
d?tagainst:dot(match)dat(match)dog(no match, thetis missing)
Writing Tests
- Create
test/glob_test.gleam:
pub fn literal_match_test() {
match_pattern([Literal("a")], "a")
|> should.be_true()
}
pub fn anychar_test() {
match_pattern([AnyChar], "x")
|> should.be_true()
}
pub fn wildcard_many_test() {
match_pattern([Wildcard, Literal(".gleam")], "mymodule.gleam")
|> should.be_true()
}
- Each test covers exactly one behavior
- Test names end in
_testso gleeunit discovers them automatically should.be_true()andshould.be_false()are clearer thanshould.equal(True)for Boolean results- Run with
gleam test
Filtering a List of Files
Combining the parser and matcher gives a useful utility:
pub fn filter_files(
pattern: String,
files: List(String),
) -> List(String) {
let elems = parse_pattern(pattern)
list.filter(files, match_pattern(elems, _))
}
match_pattern(elems, _)is a partial application: it creates a one-argument function that tests a single filenamelist.filterapplies it to every elementparse_patternruns once, whileList(Elem)is reused for every file- This is an example of using a function value as data
list.filterexpectsfn(String) -> Bool- Partial application produces exactly that without writing a named helper function
Check Understanding
Why does wildcard try zero characters first?
If the wildcard tried consuming a character first,
a pattern like *x against "x" would fail:
the wildcard would consume x,
leaving nothing for the literal x to match.
Trying zero first means a wildcard only uses up characters
when the rest of the pattern cannot match without them.
Why does parse_chars prepend to acc and call list.reverse at the end rather than appending each element?
Gleam lists are singly-linked, so prepending ([elem, ..acc]) is O(1):
it just creates a new node pointing at the existing list.
Appending to the end, by contrast, requires walking the whole list to find the last node,
making each append O(n) and the full parse O(n²).
Building the list in reverse with prepend and reversing once at the end keeps the total work O(n).
Exercises
Parse integration (15 minutes)
Wire parse_pattern to filter_files
so the function accepts a String pattern rather than a List(Elem).
Write gleeunit tests for filter_files that cover:
- an empty file list
- a pattern that matches no files
- a pattern that matches all files
- a mixed result
Character sets (20 minutes)
Add a CharSet(List(String)) variant to Elem.
A CharSet matches any single character that appears in its list,
so CharSet(["a", "e", "i", "o", "u"]) matches any vowel.
Update match_pattern_chars to handle the new variant.
Write at least four tests.
Negated character sets (15 minutes)
Add NegCharSet(List(String)) that matches any character not in the list.
Update the matcher and write tests confirming that
it does not match characters in the list
but does match characters outside it.
Anchor to start and end (10 minutes)
Shell globs implicitly match anywhere unless anchored.
Add Start and End variants that require the match to begin at the first character
and end at the last character respectively.
Update match_pattern (not just match_pattern_chars) to interpret them.