Data Dictionaries
I was helping some friends analyze some data today,
and discovered that the ./data
directory in the project they had inherited
contained a file called manifest.csv
that was loaded and echoed in the top of their analysis notebook.
I can’t show you what it contained—their data isn’t public—but
the equivalent for Allison Horst’s Palmer Penguins dataset
would look something like this:
table,column,type,unit,na,meaning
penguins,species,text,NA,false,common name of species
penguins,island,text,NA,false,island where data collected
penguins,bill_length,number,mm,true,bill length (Figure 1)
penguins,bill_depth,number,mm,true,bill depth (Figure 1)
penguins,flipper_length,number,mm,true,flipper length (Figure 2)
penguins,body_mass_g,number,mm,true,bird weight
penguins,sex,text,NA,true,bird sex
It’s easier to see and appreciate laid out like this:
table | column | type | unit | na | meaning |
---|---|---|---|---|---|
penguins | species | text | NA | false | common name of species |
penguins | island | text | NA | false | island where data collected |
penguins | bill_length | number | mm | true | bill length (Figure 1) |
penguins | bill_depth | number | mm | true | bill depth (Figure 1) |
penguins | flipper_length | number | mm | true | flipper length (Figure 2) |
penguins | body_mass_g | number | mm | true | bird weight |
penguins | sex | text | NA | true | bird sex |
The table name is included because
the manifest.csv
I’m imitating described several related data files;
one of the column descriptions even said,
“Foreign key into other_table
/other_name
”.
This doesn’t include everything—for example,
it doesn’t specify which text fields are enumerations (or factors, if you’re a statistician)—and
the figures referred to in the original manifest.csv
aren’t anywhere in the project repository—but
wouldn’t life be better if every project you worked with came with something like this?
Having once spent several days trying to figure out
which temperature measurements in a dataset were °C and which were °F,
having SI units somewhere discoverable
was enough to make me swoon.