Software Design by Example in Python 3: Finding Duplicate Files

Posted 2024-04-03

The first chapter of Software Design by Example looks inward at Python itself to show how objects and classes are really just fancy dictionaries (for some large value of “just”). Chapter 3: Finding Duplicate Files looks outward at the problems Python is used to solve, and in doing so, introduces some ideas about hashing, systems programming, and the conventions that Unix command-line tools are expected to respect.

The first and third of these ideas are well-defined, but “systems programming” turned out to be a slippery concept. For me, it means “doing things with a program that you’d normally do in the shell”, like making directories and renaming files or changing their permissions, but I discovered that other people reserve the term for more advanced problems like authentication and network configuration. I’m now working sporadically on a new tutorial to figure out where I think the topic’s boundaries are, which is proof yet again that every lesson leads to three others.

Concept map for finding duplicate files

Terms defined: big-oh notation, binary mode, bucket, collision (in hashing), cryptographic hash function, hash code, hash function, hexadecimal, SHA-256 (hash function), streaming API, time complexity.