Chapter 6: Data Tables
Terms defined: character encoding, column-major storage, data frame, fixed-width (of strings), garbage collection, heterogeneous, homogeneous, immutable, index (in a database), JavaScript Object Notation, join, pad (a string), row-major storage, sparse matrix, SQL, tagged data, test harness
Chapter 2 said that operations in memory are thousands of times faster than operations that touch the filesystem, but what about different in-memory operations—how do they compare with each other? Putting it another way, how can we tell which of several designs is going to be the most efficient?
The best answer is to conduct some experiments. To see how to do this, we will take a look at several ways to implement data tables with one or more named columns and zero or more rows. Each row has one value for each column, and all the values in a column have the same type (Figure 6.1). Data tables appear over and over again in programming, from spreadsheets and databases to the data frames in R’s tidyverse packages, Python’s Pandas library, or the DataForge library for JavaScript [Davis2018].
The key operations on data tables are those provided by SQL: filter, select, summarize, and join. These can be implemented in about 500 lines of code, but their performance depends on how the data table is stored.
Section 6.1: How can we implement data tables?
One way to store a table is row-major order, in which the values in each row are stored together in memory. This is sometimes also called heterogeneous storage because each “unit” of storage can contain values of different types. We can implement this design in JavaScript using an array of objects, each of which has the same keys (Figure 6.2).
Another option is column-major or homogeneous order, in which all the values in a column are stored together. In JavaScript, this could be implemented using an object whose members are all arrays of the same length.
To find out which is better we will construct one of each, try some operations, record their execution times and memory use, and then compare them. Crucially, the answer will depend on both the implementations themselves and on what mix of operations we measure. For example, if one strategy works better for filter and another for select, the ratio of filters to selects may determine which is “best”.
Immutability
All of our implementations will treat each data table as immutable: once we have created it, we will not modify its contents. This doesn’t actually have much impact on performance and makes the programming easier and safer, since shared data structures are a rich source of bugs.
For our first experiment, let’s build a row-major table with some number of columns. To keep it simple, we will use the row indexes to fill the table:
export const buildRows = (nRows, labels) => {
const result = []
for (let iR = 0; iR < nRows; iR += 1) {
const row = {}
labels.forEach(label => {
row[label] = iR
})
result.push(row)
}
return result
}
Next,
we write filter
and select
for tables laid out this way.
We need to provide a callback function to filter
to determine which rows to keep
like the callback for Array.filter
;
for selecting columns,
we provide a list of the keys that identify the columns we want to keep.
We expect filtering to be relatively fast,
since it is recycling rows,
while selecting should be relatively slow because we have to construct a new set of arrays
(Figure 6.3).
const rowFilter = (table, func) => {
return table.filter(row => func(row))
}
const rowSelect = (table, toKeep) => {
return table.map(row => {
const newRow = {}
toKeep.forEach(label => {
newRow[label] = row[label]
})
return newRow
})
}
Now let’s do the same for column-major storage. Building the object that holds the columns is straightforward:
export const buildCols = (nRows, labels) => {
const result = {}
labels.forEach(label => {
result[label] = []
for (let iR = 0; iR < nRows; iR += 1) {
result[label].push(iR)
}
})
return result
}
Filtering is more complex because the values in each row are scattered across several arrays,
but selecting is just a matter of recycling the arrays we want in the new table.
We expect selecting to be relatively fast,
since only the references to the columns need to be copied,
but filtering will be relatively slow since we are constructing multiple new arrays
(Figure 6.4).
Note that this code assumes there is a column called label_1
that it can look at to find out how many rows there are in the table.
This would be a horrible thing to put into production,
but is good enough for some simple performance testing.
const colFilter = (table, func) => {
const result = {}
const labels = Object.keys(table)
labels.forEach(label => {
result[label] = []
})
for (let iR = 0; iR < table.label_1.length; iR += 1) {
if (func(table, iR)) {
labels.forEach(label => {
result[label].push(table[label][iR])
})
}
}
return result
}
const colSelect = (table, toKeep) => {
const result = {}
toKeep.forEach(label => {
result[label] = table[label]
})
return result
}
Not quite polymorphic
Our tests would be simpler to write if the two versions of filter
and select
took exactly the same parameters,
but the row-testing functions for filter
are different
because of the differences in the ways the tables are stored.
We could force them to be the same by (for example)
packing the values for each row in the column-major implementation
into a temporary object
and passing that to the same filtering function we used for the row-major implementation,
but that extra work would bias the performance comparison in row-major’s favor.
Section 6.2: How can we test the performance of our implementations?
Now that we have our tables and operations, we can build a test harness to run those operations on data tables of varying sizes. We arbitrarily decide to keep half of the columns and one-third of the rows; these ratios will affect our decision about which is better, so if we were doing this for a real application we would test what happens as these fractions vary. And as we said earlier, the relative performance will also depend on how many filters we do for each select; our balance should be based on data from whatever application we intend to support.
Our performance measurement program looks like this:
const RANGE = 3
const main = () => {
const nRows = parseInt(process.argv[2])
const nCols = parseInt(process.argv[3])
const filterPerSelect = parseFloat(process.argv[4])
const labels = [...Array(nCols).keys()].map(i => `label_${i + 1}`)
const someLabels = labels.slice(0, Math.floor(labels.length / 2))
assert(someLabels.length > 0,
'Must have some labels for select (array too short)')
const [rowTable, rowSize, rowHeap] = memory(buildRows, nRows, labels)
const [colTable, colSize, colHeap] = memory(buildCols, nRows, labels)
const rowFilterTime =
time(rowFilter, rowTable,
row => ((row.label_1 % RANGE) === 0))
const rowSelectTime =
time(rowSelect, rowTable, someLabels)
const colFilterTime =
time(colFilter, colTable,
(table, iR) => ((table.label_1[iR] % RANGE) === 0))
const colSelectTime =
time(colSelect, colTable, someLabels)
const ratio = calculateRatio(filterPerSelect,
rowFilterTime, rowSelectTime,
colFilterTime, colSelectTime)
const result = {
nRows,
nCols,
filterPerSelect,
rowSize,
rowHeap,
colSize,
colHeap,
rowFilterTime,
rowSelectTime,
colFilterTime,
colSelectTime,
ratio
}
console.log(yaml.safeDump(result))
}
The functions that actually do the measurements
use the microtime
library to get microsecond level timing
because JavaScript’s Date
only gives us millisecond-level resolution.
We use object-sizeof
to estimate how much memory our structures require;
we also call process.memoryUsage()
and look at the heapUsed
value
to see how much memory Node is using while the program runs,
but that may be affected by garbage collection
and a host of other factors outside our control.
const memory = (func, ...params) => {
const before = process.memoryUsage()
const result = func(...params)
const after = process.memoryUsage()
const heap = after.heapUsed - before.heapUsed
const size = sizeof(result)
return [result, size, heap]
}
const time = (func, ...params) => {
const before = microtime.now()
func(...params)
const after = microtime.now()
return after - before
}
const calculateRatio = (f2S, rFilterT, rSelectT, cFilterT, cSelectT) => {
return ((f2S * rFilterT) + rSelectT) / ((f2S * cFilterT) + cSelectT)
}
Let’s run our program for a table with 100 rows and 3 columns and a 3:1 ratio of filter to select:
node table-performance.js 100 3 3
nRows: 100
nCols: 3
filterPerSelect: 3
rowSize: 6600
rowHeap: 26512
colSize: 2442
colHeap: 8536
rowFilterTime: 75
rowSelectTime: 111
colFilterTime: 137
colSelectTime: 48
ratio: 0.7320261437908496
What if we increase the table size to 10,000 rows by 30 columns with the same 3:1 filter/select ratio?
nRows: 10000
nCols: 30
filterPerSelect: 3
rowSize: 7020000
rowHeap: 18392064
colSize: 2400462
colHeap: -3473800
rowFilterTime: 2929
rowSelectTime: 15863
colFilterTime: 4529
colSelectTime: 104
ratio: 1.8004528522386969
And what if we keep the table size the same but use a 10:1 filter/select ratio?
nRows: 10000
nCols: 30
filterPerSelect: 10
rowSize: 7020000
rowHeap: 18287160
colSize: 2400462
colHeap: -3645056
rowFilterTime: 2376
rowSelectTime: 15566
colFilterTime: 4380
colSelectTime: 90
ratio: 0.8960127591706539
value | 100-03-03 | 10000-30-03 | 10000-30-10 |
---|---|---|---|
nRows | 100 | 10000 | 10000 |
nCols | 3 | 30 | 30 |
filterPerSelect | 3 | 3 | 10 |
rowFilterTime | 75 | 2929 | 2376 |
rowSelectTime | 111 | 15863 | 15566 |
colFilterTime | 137 | 4529 | 4380 |
colSelectTime | 48 | 104 | 90 |
The results in Table 6.1 show that column-major storage is better. It uses less memory (presumably because column labels aren’t duplicated once per row) and the time required to construct new objects when doing select with row-major storage outweighs cost of appending to arrays when doing filter with column-major storage. Unfortunately, the code for column-major storage is a little more complicated to write, which is a cost that doesn’t show up in experiments.
Section 6.3: What is the most efficient way to save a table?
Data is valuable, so we are going to store data tables in files of some kind. If one storage scheme is much more efficient than another and we are reading or writing frequently, that could change our mind about which implementation to pick.
Two simple text-based schemes are row-oriented and column-oriented JSON—basically, just printing the data structures we have. Let’s run the 10,000×30 test:
nRows: 10000
nCols: 30
rowStringTime: 57342
rowStringSize: 9393402
colStringTime: 13267
colStringSize: 2934164
The time needed for the row-major version is almost ten times greater than that needed for the column-major version; we assume that the redundant printing of the labels is mostly to blame, just as redundant storage of the labels was to blame for row-major’s greater memory requirements.
If that diagnosis is correct, then a packed version of row-major storage ought to be faster. We save the column headers once, then copy the data values into an array of arrays and save that:
const asPackedJson = (table) => {
const temp = {}
temp.keys = Object.keys(table[0])
temp.values = table.map(row => temp.keys.map(k => row[k]))
return JSON.stringify(temp)
}
nRows: 10000
nCols: 30
packedRowStringTime: 29659
packedRowStringSize: 2974084
These results show that changing layout for storage is faster than turning the data structure we have into a string. Again, we assume this is because copying data takes less time than turning labels into strings over and over, but column-major storage is still the best approach.
Section 6.4: Does binary storage improve performance?
Let’s try one more strategy for storing our tables. JavaScript stores values in tagged data structures: some bits define the value’s type while other bits store the value itself in a type-dependent way (Figure 6.5).
We can save space by keeping track of the types ourselves
and just storing the bits that represent the values.
JavaScript has an ArrayBuffer
class for exactly this purpose.
It stores any value we want as a set of bits;
we then access those bits through a view that presents the data as a particular type,
such as Boolean (one byte per value) or number (64 bits per number).
As Figure 6.6 shows,
we can mix different types of data in a single ArrayBuffer
,
but it’s up to us to keep track of which bytes belong to which values.
To store a column-major table we will fill an ArrayBuffer
with:
-
Two integers that hold the table’s size (number of rows and number of columns).
-
A string with the column labels joined by newline characters. (We use newlines as a separator because we assume column labels can’t contain them.)
-
The numbers themselves.
const asBinary = (table) => {
const labels = Object.keys(table)
const nCols = labels.length
const nRows = table[labels[0]].length
const dimensions = new Uint32Array([nCols, nRows])
const allLabels = labels.join('\n')
const encoder = new TextEncoder()
const encodedLabels = encoder.encode(allLabels)
const dataSize = sizeof(0) * nCols * nRows
const totalSize =
dimensions.byteLength + encodedLabels.byteLength + dataSize
const buffer = new ArrayBuffer(totalSize)
const result = new Uint8Array(buffer)
result.set(dimensions, 0)
result.set(encodedLabels, dimensions.byteLength)
let current = dimensions.byteLength + encodedLabels.byteLength
labels.forEach(label => {
const temp = new Float64Array(table[label])
result.set(temp, current)
current += temp.byteLength
})
return result
}
nRows: 10000
nCols: 30
packedColBinaryTime: 6074
packedColBinarySize: 2400268
Packing the data table saves time because copying bits is faster than turning numbers into characters, but it doesn’t save as much space as expected. The reason is that double-precision numbers are 8 bytes long, but because we have chosen simple integer values for our tests, they can be represented by just 5 characters (which is 10 bytes). If we had “real” numbers the storage benefit would probably be more pronounced; once again, the result of our experiment depends on the test cases we choose.
Engineering
If science is the use of the experimental method to investigate the world, engineering is the use of the experimental method to investigate and improve the things that people build. Good software designers collect and analyze data all the time to find out whether one website design works better than another [Kohavi2020] or to improve the performance of CPUs [Patterson2017]; a few simple experiments like these can sometimes save weeks or months of effort.
Section 6.5: Exercises
Varying filter behavior
How does our decision about which storage format is better change if we keep 1% of rows when filtering instead of one-third? What if we keep 90% of rows?
Filtering by strings
Modify the comparison of filter and select to work with tables that contain columns of strings instead of columns of numbers and see how that changes performance. For testing, create random 4-letter strings using the characters A-Z and then filter by:
- an exact match,
- strings starting with a specific character, and
- strings that contain a specific character.
Join performance
A join combines data from two tables based on matching keys. For example, if the two tables are:
Key | Left |
---|---|
A | a1 |
B | b1 |
C | c1 |
and:
Key | Right |
---|---|
A | a2 |
A | a3 |
B | b2 |
then the join is:
Key | Left | Right |
---|---|---|
A | a1 | a2 |
A | a1 | a3 |
B | b1 | b2 |
Write a test to compare the performance of row-wise vs. column-wise storage when joining two tables based on matching numeric keys. Does the answer depend on the fraction of keys that match?
Join optimization
The simplest way to join two tables is to look for matching keys using a double loop. An alternative is to build an index for each table and then use it to construct matches. For example, suppose the tables are:
Key | Left |
---|---|
A | a1 |
B | b1 |
C | c1 |
and:
Key | Right |
---|---|
A | a2 |
A | a3 |
B | b2 |
The first step is to create a Map
showing where each key is found in the first table:
{A: [0], B: [1], C: [2]}
The second step is to create a similar Map
for the second table:
{A: [0, 1], B: [2]}
We can then loop over the keys in one of the maps, look up values in the second map, and construct all of the matches.
Write a function that joins two tables this way. Is it faster or slower than using a double loop? How does the answer depend on the number of keys and the fraction that match?
Flipping storage
Our tests showed that storing row-oriented tables as JSON is much slower than storing column-oriented tables. Write a test to determine whether converting a row-oriented table to a column-oriented table and then saving the latter is faster than saving the row-oriented table directly.
Sparse storage
A sparse matrix is one in which most of the values are zero. Instead of storing them all, a program can use a map to store non-zero values and a lookup function to return zero for anything that isn’t stored explicitly:
def spareMatrixGet(matrix, row, col) => {
return matrix.contains(row, col)
? matrix.get(row, col)
: 0
}
The same technique can be used if most of the entries in a data table are missing. Write a function that creates a sparse table in which a random 5% of the values are non-zero and the other 95% are zero, then compare the memory requirements and performance of filter and select for this implementation versus those of row-wise and column-wise storage.
Loading time
Modify the programs in this section to measure the time required to convert a data table from JSON or binary form back to a data structure.
Saving fixed-width strings
To improve performance, databases often store fixed-width strings, i.e., they limit the length of the strings in a column to some fixed size and pad strings that are shorter than that.
-
Write a function that takes an array of strings and an integer width and creates an
ArrayBuffer
containing the strings padded to that width. The function should throw an exception if any of the strings are longer than the specified width. -
Write another function that takes an
ArrayBuffer
as input and returns an array of strings. This function should remove the padding so that strings shorter than the fixed width are restored to their original form.
Saving variable-width strings
Fixed-width storage is inefficient for large blocks of text such as contracts, novels, and resumés, since padding every document to the length of the longest will probably waste a lot of space. An alternative way to store these in binary is to save each entry as a (length, text) pair.
-
Write a function that takes a list of strings as input and returns an
ArrayBuffer
containing (length, text) pairs. -
Write another function that takes such an
ArrayBuffer
and returns an array containing the original text. -
Write tests with Mocha to confirm that your functions work correctly.
ASCII storage
The original ASCII standard specified a 7-bit character encoding for letters commonly used in English, and many data files still only use characters whose numeric codes are in the range 0–127.
-
Write a function that takes an array of single-letter strings and returns an
ArrayBuffer
that stores them using one byte per character if all of the characters will fit into 7 bits, and multiple bytes per character if any of the characters require more than 7 bits. -
Write another function that takes an
ArrayBuffer
generated by the first function and re-creates the array of characters. The function must only take theArrayBuffer
as an argument, so the first element of theArrayBuffer
should indicate how to interpret the rest of its contents. -
Write tests with Mocha to check that your functions work correctly.