A File Archiver
- Build a content-addressable file archiver like Git's object store
- Hash files, deduplicate by content, and store blobs under their hash
- Combine everything from previous lessons: types, pipelines, IO, and testing with
#guard
- A content-addressable store identifies files by their hash
rather than their name
- Two files with the same content get the same address
- Like Git's
.git/objectsdirectory
- We'll build it in layers: types first, then pure logic, then I/O
- See the JavaScript and Python versions for comparison
Type Aliases
i
-- A content address is just a hex string of the hash
abbrev Hash := String
abbrevcreates a type alias (a shorter name for an existing type)HashandStringare interchangeable
- A content address is just a hex string of the file's hash
i
-- A manifest maps each original file path to its content hash
abbrev Manifest := List (String × Hash)
- A manifest maps original file paths to their content hashes
List (String × Hash)is a list of(path, hash)pairsabbrevmakes the intent clearer than writingList (String × Hash)everywhere
abbrev vs. def for Type Aliases
- Lean has two ways to create a type alias:
abbrevanddef abbrevis transparent: the type checker sees through it and treatsHashexactly asString- Type class instances for
String(likeToStringorBEq) automatically apply toHash - You can pass a
Hashanywhere aStringis expected without any conversion
- Type class instances for
defis opaque: the type checker treats the alias as a distinct type- Instances do not transfer automatically; you must re-derive or re-declare them
- Passing a
def-aliased type where the original is expected is a type error
- Use
abbrevwhen the alias is purely for readability and you want full interchangeability - Use
defwhen you want the type system to enforce that the aliased type is not accidentally mixed with the original
Hashing Bytes
i
-- Hash a ByteArray to a 16-character hex string
def hashBytes (ba : ByteArray) : Hash :=
let h : UInt64 := hash ba
let hex := "0123456789abcdef".toList
let digits := (List.range 16).map fun i =>
let nibble := (h >>> (60 - i * 4).toUInt64) &&& 0xF
hex[nibble.toNat]!
String.ofList digits
hashBytestakes aByteArrayand returns a 16-character hexHashhashis Lean's built-in structural hash function- Returns a
UInt64(unsigned 64-bit integer) - A real archiver would use SHA-256, but this is good enough for teaching
- Returns a
- The body formats the
UInt64into hex by extracting each 4-bit nibbleh >>> nshifts the hash right bynbits&&& 0xFmasks off the lowest 4 bits to get one hex digit
String.ofList digitsconverts the list ofCharto aStringString.mkwould also work, but Lean now prefersString.ofList
Building the Manifest
i
-- Pure: deduplicate a list of (path, bytes) pairs into (Manifest, blob store)
def buildManifest (files : List (String × ByteArray)) :
Manifest × List (Hash × ByteArray) :=
let entries := files.map fun (path, bytes) => (path, hashBytes bytes, bytes)
let manifest := entries.map fun (path, h, _) => (path, h)
let blobs := entries.foldl (init := []) fun acc (_, h, bytes) =>
if acc.any (·.1 == h) then acc else acc ++ [(h, bytes)]
(manifest, blobs)
buildManifestis the pure core of the archiver- No I/O: takes data in, returns data out
- Hence the name "pure"
- Step 1: hash every file's bytes and create
(path, hash, bytes)triples - Step 2: extract just the
(path, hash)pairs into the manifest - Step 3: deduplicate blobs by checking if a hash already exists in the accumulator
acc.any (·.1 == h): "does any entry inacchave.1equal toh?"(·.1 == h)uses the·placeholder for the tuple's first field
- Returns a pair
(manifest, blobs)manifestmaps every path to its hash (even duplicates)blobsstores each unique(hash, bytes)pair only once
Snapshot: Writing Files
i
-- IO: write each unique blob to a file named <hash>.bck
def snapshot (files : List (String × ByteArray)) (archiveDir : String) :
IO Manifest := do
let (manifest, blobs) := buildManifest files
for (h, bytes) in blobs do
IO.FS.writeBinFile s!"{archiveDir}/{h}.bck" bytes
return manifest
snapshotis an I/O function: note theIO Manifestreturn type- Calls the pure
buildManifest, then writes each unique blob to disk for (h, bytes) in blobs doiterates over the blob list (instead of recursing)- Destructures each
(hash, bytes)pair in the loop header
- Destructures each
IO.FS.writeBinFilewrites raw bytes to a file- Like Python's
open(path, "wb").write(data)
- Like Python's
- Each blob is saved as
archiveDir/<hash>.bck- The
.bckextension is just a convention — like Git's loose object format
- The
- Returns the manifest so the caller can save or transmit it
Restore: Reading Files Back
i
-- IO: read blobs from the archive and write them back to their original paths
def restore (manifest : Manifest) (archiveDir : String) : IO Unit := do
for (path, h) in manifest do
let bytes ← IO.FS.readBinFile s!"{archiveDir}/{h}.bck"
IO.FS.writeBinFile path bytes
restorereverses the process- Read blobs from the archive
- Write them to original paths
IO Unitmeans "does I/O but returns nothing useful"- Like a Python function that returns
None
- Like a Python function that returns
←captures the result ofIO.FS.readBinFile- Reading from disk is I/O, so we need the left arrow
s!"{archiveDir}/{h}.bck"constructs the blob path using string interpolation- Then
IO.FS.writeBinFilewrites the bytes back to the original file path
Listing a Directory
i
-- Collect all regular files in a directory as (path, bytes) pairs
def listFiles (dir : String) : IO (List (String × ByteArray)) := do
let entries ← (dir : System.FilePath).readDir
let mut files : List (String × ByteArray) := []
for entry in entries do
let info ← entry.path.metadata
if info.type == IO.FS.FileType.file then
let bytes ← IO.FS.readBinFile entry.path
files := files ++ [(entry.path.toString, bytes)]
return files
dir.readDirlists all entries in a directory and returns anArray IO.FS.DirEntrydirmust be aSystem.FilePath; aStringis coerced automatically- Like Python's
os.scandir(dir)— returns both files and subdirectories
entry.path.metadatainspects the filesystem entry- Returns an
IO.FS.Metadatarecord with a.typefield IO.FS.FileType.filemeans a regular file;.diris a subdirectory
- Returns an
let mut files := []accumulates results using a mutable variable in thedoblock (see io)- The result is a list of
(path, bytes)pairs — exactly whatsnapshotalready expects
Archiving a Directory
i
-- Archive every regular file in srcDir to archiveDir
def snapshotDir (srcDir archiveDir : String) : IO Manifest := do
IO.FS.createDirAll archiveDir
let files ← listFiles srcDir
snapshot files archiveDir
snapshotDiris four lines: create the archive directory, list files, callsnapshotsnapshot(fromarchive.lean) is unchanged — it accepts anyList (String × ByteArray)- This is the payoff of pure-core design: the I/O boundary is thin and swappable
- Swapping hand-written sample files for real disk files required only
listFiles
- Swapping hand-written sample files for real disk files required only
i
-- Entry point: archive a directory and report results
-- Run with: lean --run archive/archive_dir.lean
def main : IO Unit := do
let archiveDir := "archive-dir"
IO.println s!"Scanning current directory..."
let manifest ← snapshotDir "." archiveDir
IO.println s!"Archived {manifest.length} file(s) to {archiveDir}/"
let blobs := (buildManifest (← listFiles ".")).2.length
IO.println s!"Unique content blobs: {blobs}"
IO.println "Done."
- To compile and type-check:
lake env lean archive/archive_dir.lean
- To run and create a real archive:
lean --run archive/archive_dir.lean
- The distinction between these two commands is important; see build for a full explanation
Testing the Pure Core
i
-- Tests (pure core) ---------------------------------------------------
def sampleFiles : List (String × ByteArray) := [
("a.txt", "hello".toUTF8),
("b.txt", "hello".toUTF8), -- same content as a.txt
("c.txt", "world".toUTF8)
]
-- Two files with identical content should produce only two blobs
#guard (buildManifest sampleFiles).2.length == 2
-- Both a.txt and b.txt map to the same hash
#guard
let (manifest, _) := buildManifest sampleFiles
(manifest.find? (·.1 == "a.txt")).map (·.2) ==
(manifest.find? (·.1 == "b.txt")).map (·.2)
- We can test the pure logic without touching the filesystem
- Because
buildManifestis pure, tests are fast and deterministic
- Because
sampleFilescreates three files:a.txtandb.txthave identical content"hello".toUTF8converts a string to aByteArray- Like Python's
"hello".encode("utf-8")
- Like Python's
- First
#guard: two identical files should produce only two blobs, not three.2.lengthaccesses the blob list (second element of the pair) and counts it
- Second
#guard:a.txtandb.txtshould map to the same hashmanifest.find? (·.1 == "a.txt")finds the entry fora.txt.map (·.2)extracts the hash from the found entry
- Both
#guardexpressions evaluate totrue, so compilation succeeds silently
The Full Picture
- Forty lines of code implement a content-addressable archive
- The design separates pure computation from I/O
buildManifestcan be tested without touching disksnapshotandrestoredo the actual reading and writing
- This is the functional style in practice:
- Types describe the data (
Hash,Manifest) - Pure functions transform it (
buildManifest) - A thin I/O layer connects to the outside world
- Types describe the data (
Running the Program
i
-- Entry point: snapshot sample files to ./archive-dir, then restore them
-- Run with: lean --run archive/archive.lean
def main : IO Unit := do
let archiveDir := "archive-dir"
-- create the archive directory (ok if it already exists)
IO.FS.createDirAll archiveDir
IO.println s!"Snapshotting to {archiveDir}/"
let manifest ← snapshot sampleFiles archiveDir
IO.println s!"Created {manifest.length} manifest entries"
IO.println "Restoring files..."
restore manifest archiveDir
IO.println "Done."
mainis the entry point when you run a Lean file as a program- Like Python's
if __name__ == "__main__":block
- Like Python's
IO.FS.createDirAll archiveDircreates the archive directory- Won't fail if the directory already exists
- The steps match our earlier sections exactly: snapshot, then restore
IO.printlnoutputs a line of text to stdoutmanifest.lengthis the number of entries in the manifest
Compiling and Running
- To compile (check types, run
#evaland#guard):lake env lean archive/archive.lean- This is what we've been doing all along
- The
#guardtests run at compile time mainis not executed: it's just type-checked
- To run the program (execute
main):lean --run archive/archive.lean- This compiles the file and calls the
mainfunction
- After running, check the results:
ls archive-dir/shows the.bckblob files- Two files, because
a.txtandb.txtare deduplicated cat a.txt,cat b.txt,cat c.txtshow the restored content in the current directory
- The two commands serve different purposes:
lake env leanfor development: type-check and testlean --runfor execution: actually do the I/O
Exercises
Write: Total File Size
Write a pure function totalSize : Manifest × List (Hash × ByteArray) → Nat that
returns the sum of all blob sizes (in bytes). Use ByteArray.size to get the size of
each blob.
hint
- Extract the blob list from the pair with
.2 - Use
List.foldlwith(· + ·)and an initial value of0 - Use
ByteArray.sizeinside the map to convert bytes to sizes
Write: Count Duplicates
Write a pure function countDuplicates (files : List (String × ByteArray)) : Nat that
returns how many files have content that also appears under another name. For the sample
data, this should return 2 (both a.txt and b.txt are duplicates of each other).
hint
- Build the manifest with
buildManifest - Group files by hash: for each hash, count how many paths map to it
- Sum the counts for hashes that appear more than once
Write: Manifest to String
Write manifestToString : Manifest → String that formats each entry as "path:hash"
with one entry per line. Use String.intercalate "\n" to join the lines.
hint
- Use
List.mapto convert each(path, hash)pair tos!"{path}:{hash}" - Then
String.intercalate "\n"to join with newlines
Fix: Hash Digits in Wrong Order
i
-- hashBytes should produce correct hex digits, but the loop extracts
-- nibbles in the wrong order. The result is reversed. Fix it.
def hashBytes (ba : ByteArray) : String :=
let h : UInt64 := hash ba
let hex := "0123456789abcdef".toList
let digits := (List.range 16).map fun i =>
let nibble := (h >>> (i * 4).toUInt64) &&& 0xF
hex[nibble.toNat]!
String.ofList digits
#guard hashBytes "hello".toUTF8 == hashBytes "hello".toUTF8
#guard hashBytes "hello".toUTF8 != hashBytes "world".toUTF8
hint
List.range 16generates[0, 1, …, 15], andi * 4shifts by 0, 4, 8, … bits each iteration- This extracts the nibbles from least-significant to most-significant, but prints them in that same order
- The most-significant nibble should come first in the output
- Reverse the shift:
60 - i * 4to start from the top nibble
Fix: Blobs Not Deduplicated
i
-- buildManifest should deduplicate blobs by hash, but it keeps every
-- blob entry instead. The blob store has duplicates. Fix the dedup check.
abbrev Hash := String
def hashBytes (ba : ByteArray) : Hash :=
toString (hash ba)
def buildManifest (files : List (String × ByteArray)) :
List (String × Hash) × List (Hash × ByteArray) :=
let entries := files.map fun (path, bytes) => (path, hashBytes bytes, bytes)
let manifest := entries.map fun (path, h, _) => (path, h)
let blobs := entries.map fun (_, h, bytes) => (h, bytes)
(manifest, blobs)
def sample : List (String × ByteArray) := [
("a.txt", "hello".toUTF8),
("b.txt", "hello".toUTF8), -- same content as a.txt
("c.txt", "world".toUTF8)
]
#guard (buildManifest sample).2.length == 2
hint
entries.mapcopies every blob, including duplicates- The original
buildManifestusesfoldlwith a check:if acc.any (·.1 == h) then acc else acc ++ [(h, bytes)] - Add a similar deduplication step before collecting blobs
Fix: Snapshot Missing Directory
i
-- snapshot should create the archive directory before writing blobs,
-- but it doesn't. The first file write will fail. Add the missing directory
-- creation step.
abbrev Hash := String
def snapshot (files : List (String × ByteArray)) (archiveDir : String) :
IO Unit := do
for (path, bytes) in files do
let h : Hash := toString (hash bytes)
IO.FS.writeBinFile s!"{archiveDir}/{h}.bck" bytes
#eval snapshot [("a.txt", "hello".toUTF8)] "test-archive"
hint
IO.FS.writeBinFilewill fail if the parent directory doesn't exist- Add
IO.FS.createDirAll archiveDirbefore theforloop createDirAllwon't fail if the directory already exists (like Python'sos.makedirs(exist_ok=True))
Fix: Restore Uses Wrong Path
i
-- restore constructs the blob path wrong: it uses the file path instead
-- of the hash to read the blob. Fix the file path construction.
abbrev Hash := String
abbrev Manifest := List (String × Hash)
def restore (manifest : Manifest) (archiveDir : String) : IO Unit := do
for (path, _h) in manifest do
let bytes ← IO.FS.readBinFile s!"{archiveDir}/{path}.bck"
IO.FS.writeBinFile path bytes
-- To test: first run snapshot on sample data, then restore with this.
-- The bug path uses the file name instead of the hash for the blob file.
hint
- The blob file is named
archiveDir/<hash>.bck, notarchiveDir/<path>.bck - The code uses
path(the original filename) as the hash — that's wrong - Fix the destructuring:
for (path, h) in manifest doand usehin the blob path
Write: Find File in Manifest
i
-- Look up a file path in the manifest and return its hash (if present).
-- You may use the existing definitions from archive.lean.
abbrev Hash := String
abbrev Manifest := List (String × Hash)
def findFile (manifest : Manifest) (path : String) : Option Hash :=
none
#guard findFile [("a.txt", "abc123"), ("b.txt", "def456")] "a.txt" == some "abc123"
#guard findFile [("a.txt", "abc123")] "c.txt" == none
#guard findFile [] "x.txt" == none
hint
- Use
List.find?to search for the path in the manifest - The predicate is
fun (p, _) => p == path - If found, extract the hash with
.map (·.2) List.find?already returnsOption, so the result type matches
Write: Count Unique Blobs
i
-- Count how many unique blobs are in the blob store.
-- A blob store is a list of (hash, bytes) pairs.
abbrev Hash := String
def countUniqueBlobs (blobs : List (Hash × ByteArray)) : Nat :=
0
def sampleBlobs : List (Hash × ByteArray) := [
("abc", "hello".toUTF8),
("def", "world".toUTF8),
("abc", "hello".toUTF8) -- duplicate hash, same content
]
#guard countUniqueBlobs sampleBlobs == 2
#guard countUniqueBlobs [] == 0
#guard countUniqueBlobs [("x", "a".toUTF8)] == 1
hint
- Extract all hashes from the blobs list, then deduplicate
- Use
List.mapto get the hash list, then keep only the first occurrence of each - Same pattern as
dbCompactfrom the database lesson: track seen hashes in afoldl - Return
List.lengthof the deduplicated list
Write: Find Files by Hash
i
-- Find all file paths in the manifest that have the given hash.
abbrev Hash := String
abbrev Manifest := List (String × Hash)
def findByHash (manifest : Manifest) (target : Hash) : List String :=
[]
def m : Manifest := [("a.txt", "abc"), ("b.txt", "def"), ("c.txt", "abc")]
#guard findByHash m "abc" == ["a.txt", "c.txt"]
#guard findByHash m "xyz" == []
#guard findByHash [] "abc" == []
hint
- Use
List.filterto keep only entries where the hash matches - The predicate is
fun (_, h) => h == target - Then use
List.map (·.1)to extract just the file paths - Like finding all files that share the same content (deduplicated files)
Write: Diff Two Manifests
i
-- Compare two manifests. Return the paths that appear in the first
-- manifest but not the second (new or renamed files).
abbrev Hash := String
abbrev Manifest := List (String × Hash)
def diffManifests (old new : Manifest) : List String :=
[]
def oldManifest : Manifest := [("a.txt", "abc"), ("b.txt", "def")]
def newManifest : Manifest := [("a.txt", "abc"), ("c.txt", "xyz")]
#guard diffManifests oldManifest newManifest == ["b.txt"]
#guard diffManifests newManifest oldManifest == ["c.txt"]
#guard diffManifests [] [] == []
hint
- Find paths that are in
oldbut not innew - Use
List.filteron the old manifest with a predicate that checks if the path exists in new new.any (fun (p, _) => p == path)checks if a path exists in the new manifest- This is like comparing two snapshots to find renamed or deleted files
Fix: Manifest Contains Wrong Hash
i
-- buildManifest constructs the manifest, but the path and hash are mispaired.
-- The entry for each file should contain (path, hash). Fix the pairing.
abbrev Hash := String
abbrev Manifest := List (String × Hash)
def hashBytes (ba : ByteArray) : Hash :=
let h : UInt64 := hash ba
let hex := "0123456789abcdef".toList
let digits := (List.range 16).map fun i =>
let nibble := (h >>> (60 - i * 4).toUInt64) &&& 0xF
hex[nibble.toNat]!
String.ofList digits
def buildManifest (files : List (String × ByteArray)) :
Manifest × List (Hash × ByteArray) :=
let entries := files.map fun (path, bytes) => (path, hashBytes bytes, bytes)
-- BUG: manifest pairs each path with the wrong file's hash
-- The manifest is built by zipping paths with the *reversed* hash list
let hashes := entries.map fun (_, h, _) => h
let manifest := List.zip
(entries.map fun (p, _, _) => p)
hashes.reverse
let blobs := entries.foldl (init := []) fun acc (_, h, bytes) =>
if acc.any (·.1 == h) then acc else acc ++ [(h, bytes)]
(manifest, blobs)
#guard
let files := [("a.txt", "hello".toUTF8), ("b.txt", "world".toUTF8)]
let (manifest, _) := buildManifest files
(manifest.find? (·.1 == "a.txt")).map (·.2) == some (hashBytes "hello".toUTF8)
#guard
let files := [("a.txt", "hello".toUTF8), ("b.txt", "hello".toUTF8)]
let (manifest, _) := buildManifest files
(manifest.find? (·.1 == "a.txt")).map (·.2) ==
(manifest.find? (·.1 == "b.txt")).map (·.2)
hint
- The manifest should pair each path with that file's own hash
- The bug reverses the hash list before pairing, so
a.txtgetsb.txt's hash and vice versa - Remove
.reverseso hashes stay in the same order as the paths