Chapter 5: File Backup
Terms defined: Application Programming Interface, collision, comma-separated values, Coordinated Universal Time, cryptographic hash function, data migration, handler, hash code, hash function, JavaScript Object Notation, mock object, pipe, race condition, SHA-1 hash, stream, streaming API, Time of check/time of use, timestamp, version control system
Now that we can test software, we have something worth saving. A version control system like Git keeps track of changes to files so that we can recover old versions if we want to. Its core is a way to archive files that:
- records which versions of which files existed at the same time (so that we can go back to a consistent previous state), and
- stores any particular version of a file only once, so that we don’t waste disk space.
In this chapter we will build a tool for doing both tasks. It won’t do everything Git does: in particular, it won’t let us create and merge branches. If you would like to know how that works, please see Mary Rose Cook’s excellent Gitlet project.
Section 5.1: How can we uniquely identify files?
To avoid storing redundant copies of files, we need a way to tell when two files contain the same data. We can’t rely on names because files can be renamed or moved over time; we could compare the files byte-by-byte, but a quicker way is to use a hash function that turns arbitrary data into a fixed-length string of bits (Figure 5.1).
A hash function always produces the same hash code for a given input. A cryptographic hash function has two extra properties:
-
The output depends on the entire input: changing even a single byte results in a different hash code.
-
The outputs look like random numbers: they are unpredictable and evenly distributed (i.e., the odds of getting any specific hash code are the same).
It’s easy to write a bad hash function, but very hard to write one that qualifies as cryptographic. We will therefore use a library to calculate 160-bit SHA-1 hashes for our files. These are not random enough to keep data secret from a patient, well-funded attacker, but that’s not what we’re using them for: we just want hashes that are random to make collision extremely unlikely.
The Birthday Problem
The odds that two people share a birthday are 1/365 (ignoring February 29). The odds that they don’t are therefore 364/365. When we add a third person, the odds that they don’t share a birthday with either of the preceding two people are 363/365, so the overall odds that nobody shares a birthday are (365/365)×(364/365)×(363/365). If we keep calculating, there’s a 50% chance of two people sharing a birthday in a group of just 23 people, and a 99.9% chance with 70 people.
We can use the same math to calculate how many files we need to hash before there’s a 50% chance of a collision. Instead of 365, we use \(2^{160}\) (the number of values that are 160 bits long), and after checking Wikipedia and doing a few calculations with Wolfram Alpha, we calculate that we would need to have approximately \(10^{24}\) files in order to have a 50% chance of a collision. We’re willing to take that risk.
Node’s crypto
module provides tools to create a SHA-1 hash.
To use them,
we create an object that keeps track of the current state of the hashing calculations,
tell it how we want to encode (or represent) the hash value,
and then feed it some bytes.
When we are done,
we call its .end
method
and then use its .read
method to get the final result:
import crypto from 'crypto'
// create a SHA1 hasher
const hash = crypto.createHash('sha1')
// encode as hex (rather than binary)
hash.setEncoding('hex')
// send it some text
const text = process.argv[2]
hash.write(text)
// signal end of text
hash.end()
// display the result
const sha1sum = hash.read()
console.log(`SHA1 of "${text}" is ${sha1sum}`)
node hash-text.js something
SHA1 of "something" is 1af17e73721dbe0c40011b82ed4bb1a7dbe3ce29
Hashing a file instead of a fixed string is straightforward: we just read the file’s contents and pass those characters to the hashing object:
import fs from 'fs'
import crypto from 'crypto'
const filename = process.argv[2]
const data = fs.readFileSync(filename, 'utf-8')
const hash = crypto.createHash('sha1').setEncoding('hex')
hash.write(data)
hash.end()
const sha1sum = hash.read()
console.log(`SHA1 of "${filename}" is ${sha1sum}`)
node hash-file.js hash-file.js
SHA1 of "hash-file.js" is c54c8ee3e576770d29ae2d0d73568e5a5c49eac0
However, it is more efficient to process the file as a stream:
import fs from 'fs'
import crypto from 'crypto'
const filename = process.argv[2]
const hash = crypto.createHash('sha1').setEncoding('hex')
fs.createReadStream(filename).pipe(hash)
hash.on('finish', () => {
const final = hash.read()
console.log('final', final)
})
console.log('program ends')
node hash-stream.js hash-stream.js
program ends
final dc9e6c231e243860dace2dbf52845b121062b60e
This kind of interface is called a streaming API because it is designed to process a stream of data one chunk at a time rather than requiring all of the data to be in memory at once. Many applications use streams so that programs don’t have to read entire (possibly large) files into memory.
To start,
this program asks the fs
library to create a reading stream for a file
and to pipe the data from that stream to the hashing object
(Figure 5.2).
It then tells the hashing object what to do when there is no more data
by providing a handler for the “finish” event.
This is called asynchronously:
as the output shows,
the main program ends before the task handling the end of data is scheduled and run.
Most programs also provide a handler for “data” events to do something with each block of data as it comes in;
the hash
object in our program does that for us.
Section 5.2: How can we back up files?
Many files only change occasionally after they’re created, or not at all.
It would be wasteful for a version control system to make copies
each time the user wanted to save a snapshot of a project,
so instead our tool will copy each unique file to something like abcd1234.bck
,
where abcd1234
is a hash of the file’s contents.
It will then store a data structure that records the filenames and hash keys for each snapshot.
The hash keys tell it which unique files are part of the snapshot,
while the filenames tell us what each file’s contents were called when the snapshot was made
(since files can be moved or renamed).
To restore a particular snapshot,
all we have to do is copy the saved .bck
files back to where they were
(Figure 5.3).
We can build the tools we need to do this using promises (Chapter 3).
The main function creates a promise that uses the asynchronous version of glob
to find files
and then:
-
checks that entries in the list are actually files;
-
reads each file into memory; and
-
calculates hashes for those files.
import fs from 'fs-extra-promise'
import glob from 'glob-promise'
import crypto from 'crypto'
const hashExisting = (rootDir) => {
const pattern = `${rootDir}/**/*`
return new Promise((resolve, reject) => {
glob(pattern, {})
.then(matches => Promise.all(
matches.map(path => statPath(path))))
.then(pairs => pairs.filter(
([path, stat]) => stat.isFile()))
.then(pairs => Promise.all(
pairs.map(([path, stat]) => readPath(path))))
.then(pairs => Promise.all(
pairs.map(([path, content]) => hashPath(path, content))))
.then(pairs => resolve(pairs))
.catch(err => reject(err))
})
}
This function uses Promise.all
to wait for the operations on all of the files in the list to complete
before going on to the next step.
A different design would combine stat, read, and hash into a single step
so that each file would be handled independently
and use one Promise.all
at the end to bring them all together.
The first two helper functions that hashExisting
relies on
wrap asynchronous operation in promises:
const statPath = (path) => {
return new Promise((resolve, reject) => {
fs.statAsync(path)
.then(stat => resolve([path, stat]))
.catch(err => reject(err))
})
}
const readPath = (path) => {
return new Promise((resolve, reject) => {
fs.readFileAsync(path, 'utf-8')
.then(content => resolve([path, content]))
.catch(err => reject(err))
})
}
The final helper function calculates the hash synchronously,
but we can use Promise.all
to wait on those operations finishing anyway:
const hashPath = (path, content) => {
const hasher = crypto.createHash('sha1').setEncoding('hex')
hasher.write(content)
hasher.end()
return [path, hasher.read()]
}
Let’s try running it:
import hashExisting from './hash-existing-promise.js'
const root = process.argv[2]
hashExisting(root).then(pairs => pairs.forEach(
([path, hash]) => console.log(path, hash)
))
node run-hash-existing-promise.js . | fgrep -v test/ | fgrep -v '~'
./backup.js 11422489e11be3d8ff76278503457665f6152ebe
./check-existing-files.js 66b933cf9e792e9a9204171d04e0f8b530ec3f4f
./figures/hash-function.pdf 0eb82de379a95ee2be3f00b38c0102e2f2f8170e
./figures/hash-function.svg 563996575d581f2a08e3e954d7faba4d189d0773
./figures/mock-fs.pdf 0b3bba44e69122ee53bcc9d777c186c84b7c2ff2
...
./x-from-to.md f0f63b3576042dfc0050029ddfcccc3c42fe275d
./x-io-streams.md 1fb4d8b7785c5e7b2f1e29588e2ba28d101ced1a
./x-json-manifests.md 223e0e4167acc6d4d81b76ba1287b90234c95e22
./x-mock-hashes.md 580edfc0cb8eaca4f3700307002ae10ee97af8d2
./x-pre-commit.md b7d945af4554fc0f64b708fe735417bee8b33eef
The code we have written is clearer than it would be with callbacks
(try rewriting it if you don’t believe this)
but the layer of promises around everything still obscures its meaning.
The same operations are easier to read when written using async
and await
:
const statPath = async (path) => {
const stat = await fs.statAsync(path)
return [path, stat]
}
const readPath = async (path) => {
const content = await fs.readFileAsync(path, 'utf-8')
return [path, content]
}
const hashPath = (path, content) => {
const hasher = crypto.createHash('sha1').setEncoding('hex')
hasher.write(content)
hasher.end()
return [path, hasher.read()]
}
const hashExisting = async (rootDir) => {
const pattern = `${rootDir}/**/*`
const options = {}
const matches = await glob(pattern, options)
const stats = await Promise.all(matches.map(path => statPath(path)))
const files = stats.filter(([path, stat]) => stat.isFile())
const contents = await Promise.all(
files.map(([path, stat]) => readPath(path)))
const hashes = contents.map(
([path, content]) => hashPath(path, content))
return hashes
}
This version creates and resolves exactly the same promises as the previous one, but those promises are created for us automatically by Node. To check that it works, let’s run it for the same input files:
import hashExisting from './hash-existing-async.js'
const root = process.argv[2]
hashExisting(root).then(
pairs => pairs.forEach(([path, hash]) => console.log(path, hash)))
node run-hash-existing-async.js . | fgrep -v test/ | fgrep -v '~'
./backup.js 11422489e11be3d8ff76278503457665f6152ebe
./check-existing-files.js 66b933cf9e792e9a9204171d04e0f8b530ec3f4f
./figures/hash-function.pdf 0eb82de379a95ee2be3f00b38c0102e2f2f8170e
./figures/hash-function.svg 563996575d581f2a08e3e954d7faba4d189d0773
./figures/mock-fs.pdf 0b3bba44e69122ee53bcc9d777c186c84b7c2ff2
...
./x-from-to.md f0f63b3576042dfc0050029ddfcccc3c42fe275d
./x-io-streams.md 1fb4d8b7785c5e7b2f1e29588e2ba28d101ced1a
./x-json-manifests.md 223e0e4167acc6d4d81b76ba1287b90234c95e22
./x-mock-hashes.md 580edfc0cb8eaca4f3700307002ae10ee97af8d2
./x-pre-commit.md b7d945af4554fc0f64b708fe735417bee8b33eef
Section 5.3: How can we track which files have already been backed up?
The second part of our backup tool keeps track of which files have and haven’t been backed up already.
It stores backups in a directory that contains backup files like abcd1234.bck
and files describing the contents of particular snapshots.
The latter are named ssssssssss.csv
,
where ssssssssss
is the UTC timestamp of the backup’s creation
and the .csv
extension indicates that the file is formatted as comma-separated values.
(We could store these files as JSON, but CSV is easier for people to read.)
Time of check/time of use
Our naming convention for index files will fail if we try to create more than one backup per second. This might seem very unlikely, but many faults and security holes are the result of programmers assuming things weren’t going to happen.
We could try to avoid this problem by using a two-part naming scheme ssssssss-a.csv
,
ssssssss-b.csv
, and so on,
but this leads to a race condition
called time of check/time of use.
If two users run the backup tool at the same time,
they will both see that there isn’t a file (yet) with the current timestamp,
so they will both try to create the first one.
import glob from 'glob-promise'
import path from 'path'
const findNew = async (rootDir, pathHashPairs) => {
const hashToPath = pathHashPairs.reduce((obj, [path, hash]) => {
obj[hash] = path
return obj
}, {})
const pattern = `${rootDir}/*.bck`
const options = {}
const existingFiles = await glob(pattern, options)
existingFiles.forEach(filename => {
const stripped = path.basename(filename).replace(/\.bck$/, '')
delete hashToPath[stripped]
})
return hashToPath
}
export default findNew
To test our program, let’s manually create testing directories with manufactured (shortened) hashes:
tree test
test
├── bck-0-csv-0
├── bck-1-csv-1
│ ├── 0001.csv
│ └── abcd1234.bck
├── bck-4-csv-2
│ ├── 0001.csv
│ ├── 3028.csv
│ ├── 3456cdef.bck
│ ├── abcd1234.bck
│ └── bcde2345.bck
├── test-backup.js
├── test-find-mock.js
└── test-find.js
3 directories, 10 files
We use Mocha to manage our tests.
Every test is an async
function;
Mocha automatically waits for them all to complete before reporting results.
To run them,
we add the line:
"test": "mocha */test/test-*.js"
in the scripts
section of our project’s package.json
file
so that when we run npm run test
,
Mocha looks for files in test
sub-directories of the directories holding our lessons.
Here are our first few tests:
import assert from 'assert'
import findNew from '../check-existing-files.js'
describe('pre-existing hashes and actual filesystem', () => {
it('finds no pre-existing files when none given or exist', async () => {
const expected = {}
const actual = await findNew('file-backup/test/bck-0-csv-0', [])
assert.deepStrictEqual(expected, actual,
'Expected no files')
})
it('finds some files when one is given and none exist', async () => {
const check = [['somefile.txt', '9876fedc']]
const expected = { '9876fedc': 'somefile.txt' }
const actual = await findNew('file-backup/test/bck-0-csv-0', check)
assert.deepStrictEqual(expected, actual,
'Expected one file')
})
it('finds nothing needs backup when there is a match', async () => {
const check = [['alpha.js', 'abcd1234']]
const expected = {}
const actual = await findNew('file-backup/test/bck-1-csv-1', check)
assert.deepStrictEqual(expected, actual,
'Expected no files')
})
it('finds something needs backup when there is a mismatch', async () => {
const check = [['alpha.js', 'a1b2c3d4']]
const expected = { a1b2c3d4: 'alpha.js' }
const actual = await findNew('file-backup/test/bck-1-csv-1', check)
assert.deepStrictEqual(expected, actual,
'Expected one file')
})
it('finds mixed matches', async () => {
const check = [
['matches.js', '3456cdef'],
['matches.txt', 'abcd1234'],
['mismatch.txt', '12345678']
]
const expected = { 12345678: 'mismatch.txt' }
const actual = await findNew('file-backup/test/bck-4-csv-2', check)
assert.deepStrictEqual(expected, actual,
'Expected one file')
})
})
and here is Mocha’s report:
> stjs@1.0.0 test
> mocha */test/test-*.js "-g" "pre-existing hashes"
pre-existing hashes and actual filesystem
✓ finds no pre-existing files when none given or exist
✓ finds some files when one is given and none exist
✓ finds nothing needs backup when there is a match
✓ finds something needs backup when there is a mismatch
✓ finds mixed matches
5 passing (16ms)
Section 5.4: How can we test code that modifies files?
The final thing our tool needs to do is copy the files that need copying and create a new index file. The code itself will be relatively simple, but testing will be complicated by the fact that our tests will need to create directories and files before they run and then delete them afterward (so that they don’t contaminate subsequent tests).
A better approach is to use a mock object
instead of the real filesystem.
A mock object has the same interface as the function, object, class, or library that it replaces,
but is designed to be used solely for testing.
Node’s mock-fs
library provides the same functions as the fs
library,
but stores everything in memory
(Figure 5.4).
This prevents our tests from accidentally disturbing the filesystem,
and also makes tests much faster
(since in-memory operations are thousands of times faster than operations that touch the disk).
We can create a mock filesystem by giving the library a JSON description of the files and what they should contain:
import assert from 'assert'
import mock from 'mock-fs'
import findNew from '../check-existing-files.js'
describe('checks for pre-existing hashes using mock filesystem', () => {
beforeEach(() => {
mock({
'bck-0-csv-0': {},
'bck-1-csv-1': {
'0001.csv': 'alpha.js,abcd1234',
'abcd1234.bck': 'alpha.js content'
},
'bck-4-csv-2': {
'0001.csv': ['alpha.js,abcd1234',
'beta.txt,bcde2345'].join('\n'),
'3024.csv': ['alpha.js,abcd1234',
'gamma.png,3456cdef',
'subdir/renamed.txt,bcde2345'].join('\n'),
'3456cdef.bck': 'gamma.png content',
'abcd1234.bck': 'alpha content',
'bcde2345.bck': 'beta.txt became subdir/renamed.txt'
}
})
})
afterEach(() => {
mock.restore()
})
})
Mocha automatically calls beforeEach
before running each tests,
and afterEach
after each tests completes
(which is yet another protocol).
All of the tests stay exactly the same,
and since mock-fs
replaces the functions in the standard fs
library with its own,
nothing in our application needs to change either.
We are finally ready to write the program that actually backs up files:
import fs from 'fs-extra-promise'
import hashExisting from './hash-existing-async.js'
import findNew from './check-existing-files.js'
const backup = async (src, dst, timestamp = null) => {
if (timestamp === null) {
timestamp = Math.round((new Date()).getTime() / 1000)
}
timestamp = String(timestamp).padStart(10, '0')
const existing = await hashExisting(src)
const needToCopy = await findNew(dst, existing)
await copyFiles(dst, needToCopy)
await saveManifest(dst, timestamp, existing)
}
const copyFiles = async (dst, needToCopy) => {
const promises = Object.keys(needToCopy).map(hash => {
const srcPath = needToCopy[hash]
const dstPath = `${dst}/${hash}.bck`
fs.copyFileAsync(srcPath, dstPath)
})
return Promise.all(promises)
}
const saveManifest = async (dst, timestamp, pathHash) => {
pathHash = pathHash.sort()
const content = pathHash.map(
([path, hash]) => `${path},${hash}`).join('\n')
const manifest = `${dst}/${timestamp}.csv`
fs.writeFileAsync(manifest, content, 'utf-8')
}
export default backup
The tests for this are more complicated than tests we have written previously because we want to check with actual file hashes. Let’s set up some fixtures to run tests on:
import backup from '../backup.js'
const hashString = (data) => {
const hasher = crypto.createHash('sha1').setEncoding('hex')
hasher.write(data)
hasher.end()
return hasher.read()
}
const Contents = {
aaa: 'AAA',
bbb: 'BBB',
ccc: 'CCC'
}
const Hashes = Object.keys(Contents).reduce((obj, key) => {
obj[key] = hashString(Contents[key])
return obj
}, {})
const Fixture = {
source: {
'alpha.txt': Contents.aaa,
'beta.txt': Contents.bbb,
gamma: {
'delta.txt': Contents.ccc
}
},
backup: {}
}
const InitialBackups = Object.keys(Hashes).reduce((set, filename) => {
set.add(`backup/${Hashes[filename]}.bck`)
return set
}, new Set())
and then run some tests:
describe('check entire backup process', () => {
beforeEach(() => {
mock(Fixture)
})
afterEach(() => {
mock.restore()
})
it('creates an initial CSV manifest', async () => {
await backup('source', 'backup', 0)
assert.strictEqual((await glob('backup/*')).length, 4,
'Expected 4 files')
const actualBackups = new Set(await glob('backup/*.bck'))
assert.deepStrictEqual(actualBackups, InitialBackups,
'Expected 3 backup files')
const actualManifests = await glob('backup/*.csv')
assert.deepStrictEqual(actualManifests, ['backup/0000000000.csv'],
'Expected one manifest')
})
it('does not duplicate files unnecessarily', async () => {
await backup('source', 'backup', 0)
assert.strictEqual((await glob('backup/*')).length, 4,
'Expected 4 files after first backup')
await backup('source', 'backup', 1)
assert.strictEqual((await glob('backup/*')).length, 5,
'Expected 5 files after second backup')
const actualBackups = new Set(await glob('backup/*.bck'))
assert.deepStrictEqual(actualBackups, InitialBackups,
'Expected 3 backup files after second backup')
const actualManifests = (await glob('backup/*.csv')).sort()
assert.deepStrictEqual(actualManifests,
['backup/0000000000.csv', 'backup/0000000001.csv'],
'Expected two manifests')
})
it('adds a file as needed', async () => {
await backup('source', 'backup', 0)
assert.strictEqual((await glob('backup/*')).length, 4,
'Expected 4 files after first backup')
await fs.writeFileAsync('source/newfile.txt', 'NNN')
const hashOfNewFile = hashString('NNN')
await backup('source', 'backup', 1)
assert.strictEqual((await glob('backup/*')).length, 6,
'Expected 6 files after second backup')
const expected = new Set(InitialBackups)
.add(`backup/${hashOfNewFile}.bck`)
const actualBackups = new Set(await glob('backup/*.bck'))
assert.deepStrictEqual(actualBackups, expected,
'Expected 4 backup files after second backup')
const actualManifests = (await glob('backup/*.csv')).sort()
assert.deepStrictEqual(actualManifests,
['backup/0000000000.csv', 'backup/0000000001.csv'],
'Expected two manifests')
})
})
> stjs@1.0.0 test
> mocha */test/test-*.js "-g" "check entire backup process"
check entire backup process
✓ creates an initial CSV manifest
✓ does not duplicate files unnecessarily
✓ adds a file as needed
3 passing (18ms)
Design for test
One of the best ways—maybe the best way—to evaluate software design is by thinking about testability [Feathers2004]. We were able to use a mock filesystem instead of a real one because the filesystem has a well-defined API that is provided to us in a single library, so replacing it is a matter of changing one thing in one place. If you have to change several parts of your code in order to test it, the code is telling you to consolidate those parts into one component.
Section 5.5: Exercises
Odds of collision
If hashes were only 2 bits long, then the chances of collision with each successive file assuming no previous collision are:
Number of Files | Odds of Collision |
---|---|
1 | 0% |
2 | 25% |
3 | 50% |
4 | 75% |
5 | 100% |
A colleague of yours says this means that if we hash four files, there’s only a 75% chance of any collision occurring. What are the actual odds?
Streaming I/O
Write a small program using fs.createReadStream
and fs.createWriteStream
that copies a file piece-by-piece
instead of reading it into memory and then writing it out again.
Sequencing backups
Modify the backup program so that manifests are numbered sequentially
as 00000001.csv
, 00000002.csv
, and so on
rather than being timestamped.
Why doesn’t this solve the time of check/time of use race condition mentioned earlier?
JSON manifests
-
Modify
backup.js
so that it can save JSON manifests as well as CSV manifests based on a command-line flag. -
Write another program called
migrate.js
that converts a set of manifests from CSV to JSON. (The program’s name comes from the term data migration.) -
Modify
backup.js
programs so that each manifest stores the user name of the person who created it along with file hashes, and then modifymigrate.js
to transform old files into the new format.
Mock hashes
-
Modify the file backup program so that it uses a function called
ourHash
to hash files. -
Create a replacement that returns some predictable value, such as the first few characters of the data.
-
Rewrite the tests to use this function.
How did you modify the main program so that the tests could control which hashing function is used?
Comparing manifests
Write a program compare-manifests.js
that reads two manifest files and reports:
-
Which files have the same names but different hashes (i.e., their contents have changed).
-
Which files have the same hashes but different names (i.e., they have been renamed).
-
Which files are in the first hash but neither their names nor their hashes are in the second (i.e., they have been deleted).
-
Which files are in the second hash but neither their names nor their hashes are in the first (i.e., they have been added).
From one state to another
-
Write a program called
from-to.js
that takes the name of a directory and the name of a manifest file as its command-line arguments, then adds, removes, and/or renames files in the directory to restore the state described in the manifest. The program should only perform file operations when it needs to, e.g., it should not delete a file and re-add it if the contents have not changed. -
Write some tests for
from-to.js
using Mocha andmock-fs
.
File history
-
Write a program called
file-history.js
that takes the name of a file as a command-line argument and displays the history of that file by tracing it back in time through the available manifests. -
Write tests for your program using Mocha and
mock-fs
.
Pre-commit hooks
Modify backup.js
to load and run a function called preCommit
from a file called pre-commit.js
stored in the root directory of the files being backed up.
If preCommit
returns true
, the backup proceeds;
if it returns false
or throws an exception,
no backup is created.