C Key Points
This appendix lists the key points for each chapter.
C.1 Getting Started
- Make tidiness a habit, rather than cleaning up your project files later.
- Include a few standard files in all your projects, such as README, LICENSE, CONTRIBUTING, CONDUCT and CITATION.
- Put runnable code in a
bin/directory. - Put raw/original data in a
data/directory and never modify it. - Put results in a
results/directory. This includes cleaned-up data and figures (i.e., everything created using what’s inbinanddata). - Put documentation and manuscripts in a
docs/directory. - Refer to The Carpentries software installation guide if you’re having trouble.
C.2 The Basics of the Unix Shell
- A shell is a program that reads commands and runs other programs.
- The filesystem manages information stored on disk.
- Information is stored in files, which are located in directories (folders).
- Directories can also store other directories, which forms a directory tree.
pwdprints the user’s current working directory./on its own is the root directory of the whole filesystem.lsprints a list of files and directories.- An absolute path specifies a location from the root of the filesystem.
- A relative path specifies a location in the filesystem starting from the current directory.
cdchanges the current working directory...means the parent directory..on its own means the current directory.mkdircreates a new directory.cpcopies a file.rmremoves (deletes) a file.mvmoves (renames) a file or directory.*matches zero or more characters in a filename.?matches any single character in a filename.wccounts lines, words, and characters in its inputs.mandisplays the manual page for a given command; some commands also have a--helpoption.
C.3 Building Tools with the Unix Shell
catdisplays the contents of its inputs.headdisplays the first few lines of its input.taildisplays the last few lines of its input.sortsorts its inputs.- Use the up-arrow key to scroll up through previous commands to edit and repeat them.
- Use
historyto display recent commands and!numberto repeat a command by number. - Every process in Unix has an input channel called standard input and an output channel called standard output.
>redirects a command’s output to a file, overwriting any existing content.>>appends a command’s output to a file.<operator redirects input to a command.- A pipe
|sends the output of the command on the left to the input of the command on the right. - A
forloop repeats commands once for every thing in a list. - Every
forloop must have a variable to refer to the thing it is currently operating on and a body containing commands to execute. - Use
$nameor${name}to get the value of a variable.
C.4 Going Further with the Unix Shell
- Save commands in files (usually called shell scripts) for re-use.
bash filenameruns the commands saved in a file.$@refers to all of a shell script’s command-line arguments.$1,$2, etc., refer to the first command-line argument, the second command-line argument, etc.- Place variables in quotes if the values might have spaces or other special characters in them.
findprints a list of files with specific properties or whose names match patterns.$(command)inserts a command’s output in place.grepselects lines in files that match patterns.- Use the
.bashrcfile in your home directory to set shell variables each time the shell runs. - Use
aliasto create shortcuts for things you type frequently.
C.5 Building Command-Line Programs in Python
- Write command-line Python programs that can be run in the Unix shell like other command-line tools.
- If the user does not specify any input files, read from standard input.
- If the user does not specify any output files, write to standard output.
- Place all
importstatements at the start of a module. - Use the value of
__name__to determine if a file is being run directly or being loaded as a module. - Use
argparseto handle command-line arguments in standard ways. - Use short options for common controls and long options for less common or more complicated ones.
- Use docstrings to document functions and scripts.
- Place functions that are used across multiple scripts in a separate file that those scripts can import.
C.6 Using Git at the Command Line
- Use
git configwith the--globaloption to configure your username, email address, and other preferences once per machine. git initinitializes a repository.- Git stores all repository management data in the
.gitsubdirectory of the repository’s root directory. git statusshows the status of a repository.git addputs files in the repository’s staging area.git commitsaves the staged content as a new commit in the local repository.git loglists previous commits.git diffshows the difference between two versions of the repository.- Synchronize your local repository with a remote repository on a forge such as GitHub.
git remotemanages bookmarks pointing at remote repositories.git pushcopies changes from a local repository to a remote repository.git pullcopies changes from a remote repository to a local repository.git restoreandgit checkoutrecover old versions of files.- The
.gitignorefile tells Git what files to ignore.
C.7 Going Further with Git
- Use a branch-per-feature workflow to develop new features while leaving the master branch in working order.
git branchcreates a new branch.git checkoutswitches between branches.git mergemerges changes from another branch into the current branch.- Conflicts occur when files or parts of files are changed in different ways on different branches.
- Version control systems do not allow people to overwrite changes silently; instead, they highlight conflicts that need to be resolved.
- Forking a repository makes a copy of it on a server.
- Cloning a repository with
git clonecreates a local copy of a remote repository. - Create a remote called
upstreamto point to the repository a fork was derived from. - Create pull requests to submit changes from your fork to the upstream repository.
C.8 Working in Teams
- Welcome and nurture community members proactively.
- Create an explicit Code of Conduct for your project modeled on the Contributor Covenant.
- Include a license in your project so that it’s clear who can do what with the material.
- Create issues for bugs, enhancement requests, and discussions.
- Label issues to identify their purpose.
- Triage issues regularly and group them into milestones to track progress.
- Include contribution guidelines in your project that specify its workflow and its expectations of participants.
- Make rules about governance explicit.
- Use common-sense rules to make project meetings fair and productive.
- Manage conflict between participants rather than hoping it will take care of itself.
C.9 Automating Analyses with Make
- Make is a widely used build manager.
- A build manager re-runs commands to update files that are out of date.
- A build rule has targets, prerequisites, and a recipe.
- A target can be a file or a phony target that simply triggers an action.
- When a target is out of date with respect to its prerequisites, Make executes the recipe associated with its rule.
- Make executes as many rules as it needs to when updating files, but always respects prerequisite order.
- Make defines automatic variables such as
$@(target),$^(all prerequisites), and$<(first prerequisite). - Pattern rules can use
%as a placeholder for parts of filenames. - Makefiles can define variables using
NAME=value. - Make also has functions such as
$(wildcard...)and$(patsubst...). - Use specially formatted comments to create self-documenting Makefiles.
C.10 Configuring Programs
- Overlay configuration specifies settings for a program in layers, each of which overrides previous layers.
- Use a system-wide configuration file for general settings.
- Use a user-specific configuration file for personal preferences.
- Use a job-specific configuration file with settings for a particular run.
- Use command-line options to change things that commonly change.
- Use YAML or some other standard syntax to write configuration files.
- Save configuration information to make your research reproducible.
C.11 Testing Software
- Test software to convince people (including yourself) that software is correct enough and to make tolerances on “enough” explicit.
- Add assertions to code so that it checks itself as it runs.
- Write unit tests to check individual pieces of code.
- Write integration tests to check that those pieces work together correctly.
- Write regression tests to check if things that used to work no longer do.
- A test framework finds and runs tests written in a prescribed fashion and reports their results.
- Test coverage is the fraction of lines of code that are executed by a set of tests.
- Continuous integration re-builds and/or re-tests software every time something changes.
C.12 Handling Errors
- Signal errors by raising exceptions.
- Use
try/exceptblocks to catch and handle exceptions. - Python organizes its standard exceptions in a hierarchy so that programs can catch and handle them selectively.
- “Throw low, catch high,” i.e., raise exceptions immediately but handle them at a higher level.
- Write error messages that help users figure out what to do to fix the problem.
- Store error messages in a lookup table to ensure consistency.
- Use a logging framework instead of
printstatements to report program activity. - Separate logging messages into
DEBUG,INFO,WARNING,ERROR, andCRITICALlevels. - Use
logging.basicConfigto define basic logging parameters.
C.13 Tracking Provenance
- Publish data and code as well as papers.
- Use DOIs to identify reports, datasets, and software releases.
- Use an ORCID to identify yourself as an author of a report, dataset, or software release.
- Data should be FAIR: findable, accessible, interoperable, and reusable.
- Put small datasets in version control repositories; store large ones on data sharing sites.
- Describe your software environment, analysis scripts, and data processing steps in reproducible ways.
- Make your analyses inspectable as well as reproducible.
C.14 Creating Packages with Python
- Use
setuptoolsto build and distribute Python packages. - Create a directory named
mypackagecontaining asetup.pyscript with a subdirectory also calledmypackagecontaining the package’s source files. - Use semantic versioning for software releases.
- Use a virtual environment to test how your package installs without disrupting your main Python installation.
- Use
pipto install Python packages. - The default repository for Python packages is PyPI.
- Use TestPyPI to test the distribution of your package.
- Use a README file for package-level documentation.
- Use Sphinx to generate documentation for a package.
- Use Read the Docs to host package documentation online.
- Create a DOI for your package using GitHub’s Zenodo integration.
- Publish details of your package in a software journal so others can cite it.