G Key Points
G.1 Simple Beginnings
- Use
print(expression)to print the value of a single expression. - Variable names may include letters, digits,
., and_, but.should be avoided, as it sometimes has special meaning. - R’s atomic data types include logical, integer, double (also called numeric), and character.
- R stores collections in homogeneous vectors of atomic types, or in heterogeneous lists.
- ‘Scalars’ in R are actually vectors of length 1.
- Vectors and lists are created using the function
c(...). - Vector indices from 1 to length(vector) select single elements.
- Negative indices to vectors deselect elements from the result.
- The index 0 on its own selects no elements, creating a vector or list of length 0.
- The expression
low:highcreates the vector of integers fromlowtohighinclusive. - Subscripting a vector with a vector of numbers selects the elements at those locations (possibly with repeats).
- Subscripting a vector with a vector of logicals selects elements where the indexing vector is
TRUE. - Values from short vectors (such as ‘scalars’) are repeated to match the lengths of longer vectors.
- The special value
NArepresents missing values, and (almost all) operations involvingNAproduceNA. - The special values
NULLrepresents a nonexistent vector, which is not the same as a vector of length 0. - A list is a heterogeneous vector capable of storing values of any type (including other lists).
- Indexing with
[returns a structure of the same type as the structure being indexed (e.g., returns a list when applied to a list). - Indexing with
[[strips away one level of structure (i.e., returns the indicated element without any wrapping). - Use
list('name' = value, ...)to name the elements of a list. - Use either
L['name']orL$nameto access elements by name. - Use back-quotes around the name with
$notation if the name is not a legal R variable name. - Use
matrix(values, nrow = N)to create a matrix withNrows containing the given values. - Use
m[i, j]to get the value at the i’th row and j’th column of a matrix. - Use
m[i,]to get a vector containing the values in the i’th row of a matrix. - Use
m[,j]to get a vector containing the values in the j’th column of a matrix. - Use
for (loop_variable in collection){ ...body... }to create a loop. - Use
if (expression) { ...body... } else if (expression) { ...body... } else { ...body... }to create conditionals. - Expression conditions must have length 1; use
any(...)andall(...)to collapse logical vectors to single values. - Use
function(...arguments...) { ...body... }to create a function. - Use variable <- function(…arguments…) { …body… }` to create a function and give it a name.
- The body of a function can be a single expression or a block in curly braces.
- The last expression evaluated in a function is returned as its result.
- Use
return(expression)to return a result early from a function.
G.2 The Tidyverse
install.packages('name')installs packages.library(name)(without quoting the name) loads a package.library(tidyverse)loads the entire collection of tidyverse libraries at once.read_csv(filename)reads CSV files that use the string ‘NA’ to represent missing values.read_csvinfers each column’s data types based on the first thousand values it reads.- A tibble is the tidyverse’s version of a data frame, which represents tabular data.
head(tibble)andtail(tibble)inspect the first and last few rows of a tibble.summary(tibble)displays a summary of a tibble’s structure and values.tibble$columnselects a column from a tibble, returning a vector as a result.tibble['column']selects a column from a tibble, returning a tibble as a result.tibble[,c]selects columncfrom a tibble, returning a tibble as a result.tibble[r,]selects rowrfrom a tibble, returning a tibble as a result.- Use ranges and logical vectors as indices to select multiple rows/columns or specific rows/columns from a tibble.
tibble[[c]]selects columncfrom a tibble, returning a vector as a result.min(...),mean(...),max(...), andstd(...)calculates the minimum, mean, maximum, and standard deviation of data.- These aggregate functions include
NAs in their calculations, and so will produceNAif the input data contains any. - Use
func(data, na.rm = TRUE)to removeNAs from data before calculations are done (but make sure this is statistically justified). filter(tibble, condition)selects rows from a tibble that pass a logical test on their values.arrange(tibble, column)orarrange(desc(column))arrange rows according to values in a column (the latter in descending order).select(tibble, column, column, ...)selects columns from a tibble.select(tibble, -column)selects out a column from a tibble.mutate(tibble, name = expression, name = expression, ...)adds new columns to a tibble using values from existing columns.group_by(tibble, column, column, ...)groups rows that have the same values in the specified columns.summarize(tibble, name = expression, name = expression)aggregates tibble values (by groups if the rows have been grouped).tibble %>% function(arguments)performs the same operation asfunction(tibble, arguments).- Use
%>%to create pipelines in which the left side of each%>%becomes the first argument of the next stage.
G.3 Creating Packages
- Develop data-cleaning scripts one step at a time, checking intermediate results carefully.
- Use
read_csvto read CSV-formatted tabular data into a tibble. - Use the
skipandnaparameters ofread_csvto skip rows and interpret certain values asNA. - Use
str_replaceto replace portions of strings that match patterns with new strings. - Use
is.numericto test if a value is a number andas.numericto convert it to a number. - Use
mapto apply a function to every element of a vector in turn. - Use
map_dfcandmap_dfrto map functions across the columns and rows of a tibble. - Pre-allocate storage in a list for each result from a loop and fill it in rather than repeatedly extending the list.
- An R package can contain code, data, and documentation.
- R code is distributed as compiled bytecode in packages, not as source.
- R packages are almost always distributed through CRAN, the Comprehensive R Archive Network.
- Most of a project’s metadata goes in a file called
DESCRIPTION. - Metadata related to imports and exports goes in a file called
NAMESPACE. - Add patterns to a file called
.Rbuildignoreto ignore files or directories when building a project. - All source code for a package must go in the
Rsub-directory. librarycalls in a package’s source code will not be executed as the package is loaded after distribution.- Data can be included in a package by putting it in the
datasub-directory. - Data must be in
.rdaformat in order to be loaded as part of a package. - Data in other formats can be put in the
inst/extdatadirectory, and will be installed when the package is installed. - Add comments starting with
#'to an R file to document functions. - Use roxygen2 to extract these comments to create manual pages in the
mandirectory. - Use
@exportdirectives in roxygen2 comment blocks to make functions visible outside a package. - Add required libraries to the
Importssection of theDESCRIPTIONfile to indicate that your package depends on them. - Use
package::functionto access externally-defined functions inside a package. - Alternatively, add
@importdirectives to roxygen2 comment blocks to make external functions available inside the package. - Import
.datafromrlangand use.data$columnto refer to columns instead of using bare column names. - Create a file called
R/package.Rand documentNULLto document the package as a whole. - Create a file called
R/dataset.Rand document the string‘dataset’to document a dataset.
G.4 Non-Standard Evaluation
- R uses lazy evaluation: expressions are evaluated when their values are needed, not before.
- Use
exprto create an expression without evaluating it. - Use
evalto evaluate an expression in the context of some data. - Use
enquoto create a quosure containing an unevaluated expression and its environment. - Use
quo_get_exprto get the expression out of a quosure. - Use
!!to splice the expression in a quosure into a function call.
G.5 Intellectual Debt
- Don’t use
setwd. - The formula operator
~delays evaluation of its operand or operands. ~was created to allow users to pass formulas into functions, but is used more generally to delay evaluation.- Some tidyverse functions define
.to be the whole data,.xand.yto be the first and second arguments, and..Nto be the N’th argument. - These convenience parameters are primarily used when the data being passed to a pipelined function needs to go somewhere other than in the first parameter’s slot.
- ‘Copy-on-modify’ means that data is aliased until something attempts to modify it, at which point it duplicated, so that data always appears to be unchanged.
G.6 Testing and Error Handling
- Operations signal conditions in R when errors occur.
- The three built-in levels of conditions are messages, warnings, and errors.
- Programs can signal these themselves using the functions
message,warning, andstop. - Operations can be placed in a call to the function
tryto suppress errors, but this is a bad idea. - Operations can be placed in a call to the function
tryCatchto handle errors. - Use testthat to write unit tests for R.
- Put unit tests for an R package in the
tests/testthatdirectory. - Put tests in files called
test_group.Rand call themtest_something. - Use
test_dirto run tests from a particular that match a pattern. - Write tests for data transformation steps as well as library functions.
G.7 Advanced Topics
- The
reticulatelibrary allows R programs to access data in Python programs and vice versa. - Use
py.whateverto access a top-level Python variable from R. - Use
r.whateverto access a top-level R definition from Python. - R is always indexed from 1 (even in Python) and Python is always indexed from 0 (even in R).
- Numbers in R are floating point by default, so use a trailing ‘L’ to force a value to be an integer.
- A Python script run from an R session believes it is the main script, i.e.,
__name__is'__main__'inside the Python script. - S3 is the most commonly used object-oriented programming system in R.
- Every object can store metadata about itself in attributes, which are set and queried with
attr. - The
dimattribute stores the dimensions of a matrix (which is physically stored as a vector). - The
classattribute of an object defines its class or classes (it may have several character entries). - When
F(X, ...)is called, andXhas classC, R looks for a function calledF.C(the.is just a naming convention). - If an object has multiple classes in its
classattribute, R looks for a corresponding method for each in turn. - Every user defined class
Cshould have functionsnew_C(to create it),validate_C(to validate its integrity), andC(to create and validate). - Use the
DBIpackage to work with relational databases. - Use
DBI::dbConnect(...)with database-specific parameters to connect to a specific database. - Use
dbGetQuery(connection, "query")to send an SQL query string to a database and get a data frame of results. - Parameterize queries using
:nameas a placeholder in the query andparams = list(name = value)as a third parameter todbGetQueryto specify actual values. - Use
dbFetchin awhileloop to page results. - Use
dbWriteTableto write an entire data frame to a table, anddbExecuteto execute a single insertion statement. - Dates… why did it have to be dates?
Wickham, Hadley. 2019. Advanced R. 2nd ed. Chapman; Hall/CRC.
Wilkinson, Leland. 2005. The Grammar of Graphics. Springer.