Binary Data
- Programs usually store integers as fixed-width binary values rather than decimal strings
- Binary packing converts heterogeneous values into a compact byte sequence for storage or transmission
- A format descriptor tells the unpacker how to reconstruct each value from raw bytes
- Variable-length data like strings needs a fixed-width length header so the reader knows where each value ends
- Networks and files speak bytes, not strings
Packable Values
/-- A value that can be packed into binary form. -/
inductive Value where
| int32 : UInt32 → Value
| str : String → Value
deriving Repr, DecidableEq
/-- Format descriptor for unpacking: one tag per field in the byte stream. -/
inductive Fmt where
| int32 : Fmt
| str : Fmt
deriving Repr, DecidableEq
Valueis a sum type: each value is either a 32-bit unsigned integer or a UTF-8 stringFmtmirrorsValuebut carries no data — it is the blueprint used during unpackingderiving Reprlets Lean print values in#eval;deriving DecidableEqlets==work in#guardtests- Like Python's
structformat strings ("I"for unsigned 32-bit,"s"for bytes), but checked at compile time
Packing Integers
/-- Pack a 32-bit unsigned integer as 4 bytes, least-significant byte first (little-endian),
matching Python's struct.pack("<I", n). -/
def packInt32 (n : UInt32) : ByteArray :=
ByteArray.mk #[
(n &&& 0xFF).toUInt8,
((n >>> 8) &&& 0xFF).toUInt8,
((n >>> 16) &&& 0xFF).toUInt8,
((n >>> 24) &&& 0xFF).toUInt8
]
#guard packInt32 31 == ByteArray.mk #[0x1F, 0x00, 0x00, 0x00]
#guard packInt32 65 == ByteArray.mk #[0x41, 0x00, 0x00, 0x00]
- Little-endian byte order: least-significant byte first, matching most modern hardware
n &&& 0xFFmasks the lowest 8 bits;n >>> 8shifts right by one byte- Applying those two operations four times extracts each byte in turn
- Like Python's
struct.pack("<I", 31), which producesb'\x1f\x00\x00\x00' - The
#guardchecks confirm that 31 (0x1F) and 65 (0x41) each land in the first byte, with the remaining three bytes zero
Unpacking Integers
/-- Read a little-endian 32-bit unsigned integer from data at offset.
Returns the value and the offset of the next byte. -/
def unpackInt32 (data : ByteArray) (offset : Nat) : Option (UInt32 × Nat) :=
if offset + 4 ≤ data.size then
let b0 := data[offset]!.toUInt32
let b1 := data[offset + 1]!.toUInt32
let b2 := data[offset + 2]!.toUInt32
let b3 := data[offset + 3]!.toUInt32
some (b0 ||| (b1 <<< 8) ||| (b2 <<< 16) ||| (b3 <<< 24), offset + 4)
else
none
#guard unpackInt32 (packInt32 31) 0 == some (31, 4)
#guard unpackInt32 (packInt32 0xDEADBEEF) 0 == some (0xDEADBEEF, 4)
data[offset]!reads one byte, panicking if out of bounds (the bounds check comes first).toUInt32widens each byte before shifting to avoid truncation- OR-ing the shifted bytes reassembles the original 32-bit value
offset + 4advances past the four bytes just consumed, ready for the next field- The second
#guarduses0xDEADBEEFto verify that all four bytes survive the round trip, not just the low byte
Packing Strings
/-- Pack a String as a 4-byte little-endian length followed by its UTF-8 bytes,
matching Python's struct.pack("i", len(s)) + s.encode("utf-8"). -/
def packStr (s : String) : ByteArray :=
let bytes := s.toUTF8
packInt32 bytes.size.toUInt32 ++ bytes
#guard packStr "hi" == ByteArray.mk #[0x02, 0x00, 0x00, 0x00, 0x68, 0x69]
s.toUTF8converts theStringto aByteArrayusing UTF-8 encodingbytes.size.toUInt32converts the byte count fromNattoUInt32for the header- The 4-byte length header lets the reader know how many bytes of content follow
- Compatible with Python's
pack("i", len(s)) + s.encode("utf-8")fromvariable_packing.pyin the Python version of this lesson
Unpacking Strings
/-- Read a length-prefixed string from data at offset. -/
def unpackStr (data : ByteArray) (offset : Nat) : Option (String × Nat) :=
match unpackInt32 data offset with
| none => none
| some (len, next) =>
let n := len.toNat
if next + n ≤ data.size then
some (String.fromUTF8! (data.extract next (next + n)), next + n)
else
none
#guard unpackStr (packStr "hello") 0 == some ("hello", 9)
- Read the length first with
unpackInt32, then slice exactly that many bytes data.extract start stopreturns bytes fromstartup to but not includingstopString.fromUTF8!decodes UTF-8 bytes back into aString; it panics on invalid UTF-8 (DEBT: show the safeString.fromUTF8?variant)- The function can fail at two points: missing length header, or truncated body;
both return
none
Packing a List
/-- Pack a single Value into bytes. -/
def packValue : Value → ByteArray
| .int32 n => packInt32 n
| .str s => packStr s
/-- Pack a list of heterogeneous values into a ByteArray. -/
def pack (values : List Value) : ByteArray :=
values.foldl (init := ByteArray.empty) fun acc v => acc ++ packValue v
packValuedispatches on theValueconstructor and calls the right helperpackfolds over the list, appending each packed value to the accumulator- Order is preserved: the first value in the list becomes the first bytes in the array
- Like Python's
struct.pack(fmt, *values), but the format is inferred from the values themselves rather than written as a separate string
Unpacking a List
/-- Unpack values from a ByteArray given a list of format descriptors.
Returns none if the buffer is too short or malformed. -/
def unpack (fmts : List Fmt) (data : ByteArray) : Option (List Value) :=
let rec go : List Fmt → Nat → List Value → Option (List Value)
| [], _, acc => some acc.reverse
| .int32 :: rest, off, acc =>
match unpackInt32 data off with
| none => none
| some (n, off') => go rest off' (.int32 n :: acc)
| .str :: rest, off, acc =>
match unpackStr data off with
| none => none
| some (s, off') => go rest off' (.str s :: acc)
go fmts 0 []
gorecurses on the format list, consuming bytes starting atoff- Results accumulate in reverse order in
acc;acc.reversecorrects that at the end - If
unpackInt32orunpackStrreturnsnone,gopropagatesnoneimmediately - The format list must exactly match the structure of the packed data — there is no self-describing header (unlike JSON or pickle)
Round-Trip Tests
#guard
let values := [Value.int32 31, .int32 65]
unpack [.int32, .int32] (pack values) == some values
#guard
let values := [Value.str "hello", .str "Python"]
unpack [.str, .str] (pack values) == some values
#guard
let values := [Value.int32 31, .str "hello", .int32 65, .str "Python"]
unpack [.int32, .str, .int32, .str] (pack values) == some values
- Three tests: two integers only, two strings only, then a mixed list
#guardruns at compile time; a passing guard produces no output- If any
#guardfails, the file does not compile — there is no separate test runner
Example Program
def main : IO Unit := do
let values : List Value := [
.int32 31,
.int32 65,
.str "hello",
.str "Python"
]
let packed := pack values
IO.println s!"Packed {packed.size} bytes"
let fmts : List Fmt := [.int32, .int32, .str, .str]
match unpack fmts packed with
| none => IO.println "Error: unpack failed"
| some vs =>
for v in vs do
match v with
| .int32 n => IO.println s!" int32: {n}"
| .str s => IO.println s!" str: {s}"
import codebrings in the library from the same Lake package (DEBT: explain Lake)- Pack two integers followed by two strings; the total is 27 bytes (4 + 4 + 4+5 + 4+6)
- The format list
[.int32, .int32, .str, .str]must match the order used inpack - Run with
lake exe examplefrom thebinary/directory
Exercises
Pack a Bool (15 min)
Add a | bool : Bool → Value constructor to Value and a matching | bool : Fmt
constructor to Fmt. Represent true as the byte 1 and false as 0.
Implement packBool and unpackBool, then extend packValue and unpack to handle
the new constructor. Add #guard tests for round-trips of both true and false.
Detect Format Mismatch (15 min)
What does unpack [.str] (pack [Value.int32 42]) return, and why? Write a #guard
that confirms the answer, then explain in a comment why that behavior could be
dangerous in a real system.
Count Bytes (10 min)
Write a pure function packedSize (fmts : List Fmt) (values : List Value) : Nat
that computes the number of bytes pack values would produce without actually
calling pack. For int32, the size is always 4. For str, it is 4 plus the
byte length of the string's UTF-8 encoding. Add a #guard that checks the result
against (pack someValues).size for a mixed list.
Fix: Unpack Reads Big-Endian
-- unpackInt32 reads bytes in big-endian order, but packInt32 writes
-- little-endian. The values don't round-trip. Fix the shift order.
def packInt32 (n : UInt32) : ByteArray :=
ByteArray.mk #[
(n &&& 0xFF).toUInt8,
((n >>> 8) &&& 0xFF).toUInt8,
((n >>> 16) &&& 0xFF).toUInt8,
((n >>> 24) &&& 0xFF).toUInt8
]
def unpackInt32 (data : ByteArray) (offset : Nat) : Option (UInt32 × Nat) :=
if offset + 4 ≤ data.size then
let b0 := data[offset]!.toUInt32
let b1 := data[offset + 1]!.toUInt32
let b2 := data[offset + 2]!.toUInt32
let b3 := data[offset + 3]!.toUInt32
some ((b0 <<< 24) ||| (b1 <<< 16) ||| (b2 <<< 8) ||| b3, offset + 4)
else
none
#guard unpackInt32 (packInt32 31) 0 == some (31, 4)
hint
packInt32writes the least-significant byte first (little-endian):b0=n&0xFF, b1=(n>>>8)&0xFF, …unpackInt32reads the first byte as the most-significant (b0 <<< 24), which is big-endian- To match, unpack should read
b0as least-significant: no shift, thenb1 <<< 8,b2 <<< 16,b3 <<< 24
Fix: Pack Writes Big-Endian
-- packInt32 writes bytes in big-endian order (MSB first) instead of
-- the expected little-endian. Fix the byte order.
def packInt32 (n : UInt32) : ByteArray :=
ByteArray.mk #[
((n >>> 24) &&& 0xFF).toUInt8,
((n >>> 16) &&& 0xFF).toUInt8,
((n >>> 8) &&& 0xFF).toUInt8,
(n &&& 0xFF).toUInt8
]
def unpackInt32 (data : ByteArray) (offset : Nat) : Option (UInt32 × Nat) :=
if offset + 4 ≤ data.size then
let b0 := data[offset]!.toUInt32
let b1 := data[offset + 1]!.toUInt32
let b2 := data[offset + 2]!.toUInt32
let b3 := data[offset + 3]!.toUInt32
some (b0 ||| (b1 <<< 8) ||| (b2 <<< 16) ||| (b3 <<< 24), offset + 4)
else
none
#guard unpackInt32 (packInt32 31) 0 == some (31, 4)
hint
- The byte array puts
n>>>24first andn&0xFFlast, which is big-endian (most-significant byte first) - Little-endian means least-significant byte first: swap the array order
Fix: Empty Format List Returns None
-- unpack of an empty format list should return some [] but it returns
-- none. Fix the base case.
inductive Value where | int32 : UInt32 → Value | str : String → Value
inductive Fmt where | int32 : Fmt | str : Fmt
def unpack (fmts : List Fmt) (_data : ByteArray) : Option (List Value) :=
match fmts with
| [] => none
| _ :: _ => none -- stub; real implementation would recurse
#guard unpack [] (ByteArray.mk #[1, 2, 3]) == some []
hint
- Unpacking nothing from any byte array should succeed and return an empty list
- The base case
[] => noneshould be[] => some [] - Like reading zero fields from a buffer: there's nothing to fail
Fix: Unpack String Without Bounds Check
-- unpackStr doesn't check if the string body extends past the end of
-- the byte array. Add a bounds check so it returns none for truncated data.
def unpackInt32 (data : ByteArray) (offset : Nat) : Option (UInt32 × Nat) :=
if offset + 4 ≤ data.size then
let b0 := data[offset]!.toUInt32
let b1 := data[offset + 1]!.toUInt32
let b2 := data[offset + 2]!.toUInt32
let b3 := data[offset + 3]!.toUInt32
some (b0 ||| (b1 <<< 8) ||| (b2 <<< 16) ||| (b3 <<< 24), offset + 4)
else
none
def unpackStr (data : ByteArray) (offset : Nat) : Option (String × Nat) :=
match unpackInt32 data offset with
| none => none
| some (len, next) =>
let n := len.toNat
some (String.fromUTF8! (data.extract next (next + n)), next + n)
-- This should return none because the data is shorter than the length header claims.
#guard unpackStr (ByteArray.mk #[0xFF, 0x00, 0x00, 0x00]) 0 == none
hint
unpackInt32returns a length, but that length might point past the end of the buffer- After reading
len, check thatnext + n ≤ data.sizebefore callingdata.extract - If the bounds check fails, return
noneto signal corrupted data
Fix: Pack Value Truncates Length
-- packValue truncates the string length to 16 bits instead of 32 bits.
-- The length header is wrong for strings longer than 65535 bytes. Fix the
-- length type so it uses the full 32-bit range.
inductive Value where | int32 : UInt32 → Value | str : String → Value
def packInt32 (n : UInt32) : ByteArray :=
ByteArray.mk #[
(n &&& 0xFF).toUInt8,
((n >>> 8) &&& 0xFF).toUInt8,
((n >>> 16) &&& 0xFF).toUInt8,
((n >>> 24) &&& 0xFF).toUInt8
]
def packValue : Value → ByteArray
| .int32 n => packInt32 n
| .str s =>
let bytes := s.toUTF8
packInt32 (bytes.size.toUInt16.toUInt32) ++ bytes
-- A 4-character string should encode length=4, not truncated.
#guard packValue (.str "test") ==
ByteArray.mk #[0x04, 0x00, 0x00, 0x00, 0x74, 0x65, 0x73, 0x74]
hint
bytes.size.toUInt16converts the byte count to a 16-bit integer before widening back to 32 bits- This truncates any length larger than 65535
- Remove the
.toUInt16step and usebytes.size.toUInt32directly
Write: Hex Dump
-- Write a function that converts a ByteArray to a hex string.
-- Each byte becomes two hex characters (e.g., 0x1F -> "1f").
def hexDump (data : ByteArray) : String :=
""
def hexDigit (n : UInt8) : String :=
let chars := "0123456789abcdef"
s!"{chars[(n >>> 4).toNat]!}{chars[(n &&& 0xF).toNat]!}"
#guard hexDump (ByteArray.mk #[0x1F, 0xA0]) == "1fa0"
#guard hexDump ByteArray.empty == ""
#guard hexDump (ByteArray.mk #[0x00, 0xFF]) == "00ff"
hint
- Iterate over each byte in the
ByteArrayusing afoldlor recursion - Convert each byte to two hex characters using the provided
hexDigithelper - Concatenate results: start with
"", append each byte's hex representation
Write: Generate Format Descriptors
-- Write a function that takes a list of Values and produces
-- the matching list of Fmt descriptors.
inductive Value where | int32 : UInt32 → Value | str : String → Value
inductive Fmt where | int32 : Fmt | str : Fmt
def fmtsOf (vals : List Value) : List Fmt :=
[]
#guard fmtsOf [Value.int32 0, Value.str ""] == [Fmt.int32, Fmt.str]
#guard fmtsOf [Value.str "x", Value.int32 42, Value.str "y"] == [Fmt.str, Fmt.int32, Fmt.str]
#guard fmtsOf [] == []
hint
- Map each
Valueconstructor to its correspondingFmtconstructor Value.int32maps toFmt.int32,Value.strmaps toFmt.str- Use
List.mapwith a function that pattern-matches on theValue
Write: Compute Packed Size
-- Write a function that computes the number of bytes a list of Values
-- would occupy when packed, without actually calling 'pack'.
inductive Value where | int32 : UInt32 → Value | str : String → Value
def packedSize (vals : List Value) : Nat :=
0
#guard packedSize [Value.int32 42, Value.int32 65] == 8
#guard packedSize [Value.str "hi"] == 6 -- 4 (length) + 2 (bytes)
#guard packedSize [] == 0
hint
- Each
int32is always 4 bytes - Each
stris 4 bytes (length header) plus the UTF-8 byte length of the string - Use
List.foldlto accumulate the total, or map then sum - For string length:
s.toUTF8.size
Write: Unpack One Value
-- Write a convenience function that unpacks exactly one value.
-- It should wrap the 'unpack' function to only accept single-element
-- format lists and return the value directly.
-- Assume these definitions exist (from code.lean):
-- unpack : List Fmt → ByteArray → Option (List Value)
-- Write a wrapper that expects exactly one Fmt and returns Option Value.
inductive Value where | int32 : UInt32 → Value | str : String → Value
def unpackOne (fmt : Fmt) (data : ByteArray) : Option Value :=
none
inductive Fmt where | int32 : Fmt | str : Fmt
hint
- Create a single-element format list
[fmt]and callunpack - Match on the result:
some [v]returnssome v,some _ornonereturnsnone - Since
unpackis defined incode.lean, you'll need to write a self-contained version or assume it's available