Organised, transparent and reproducible science using R, git, and drake
1 How to organise your project directories
The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.
Managing your projects in a reproducible fashion doesn't just make your science reproducible, it makes your life easier.
— Vince Buffalo (@vsbuffalo) April 15, 2013
1.1 Directory layout
1.1.1 The general disaster
Many people tend to organise their projects like this:
There are many reasons why we should ALWAYS avoid this:
- It is really hard to tell what version of your data is the original and what is the modified;
- It gets really messy because it mixes files with various extensions together;
- It probably takes you a lot of time to actually find things, and relate the correct figures to the exact code that has been used to generate it;
- You may face professional embarrassment when sharing your project directory with a colleague or your supervisor.
1.1.2 A niceR solution
A good project layout helps ensure the
- Integrity of data
- Portability of the project
- Easier to pick the project back up after a break
There is no one way to lay a project out. We have different approaches for different projects, reflecting the history of the project, who else is collaborating on that project.
Here are a couple of different ideas for laying a project out. This is the basic structure that I tend to use:
proj/
|-- R/
|-- data/
|-- output/
|-- |-- data/
|-- |-- figures/
|-- doc/
The
R
directory contains various files with function definitions (but only function definitions - no code that actually runs).The
data
directory contains data used in the analysis. This is treated as read only; in particular the R files are never allowed to write to the files in here. Depending on the project, these might be csv files, a database, and the directory itself may have subdirectories.The
output/data
directory contains simulation output, processed datasets, logs, or other processed things. Theoutput/figures
directory contains the output figures generated by your code. Altogether theoutput
directory only contains generated files; that is, I should always be able to delete the contents and regenerate them.The
doc
directory contains the paper. I work in Markdown which is nice because it can pick up figures directly made by R. Markdown is starting to get traction among biologists. With Word you’ll have to paste them in yourself as the figures update.
In this set up, I usually have the R script files that do things in the project root:
proj/
|-- R/
|-- data/
|-- output/
|-- |-- data/
|-- |-- figures/
|-- doc/
|-- analysis.R
For very simple projects, you might drop the R directory, perhaps replacing it with a single file analysis-functions.R
which you source
within the .R files that depend on the outputs.
The top of the analysis file usually looks something like
library(some_package)
library(some_other_package)
source("R/functions.R")
source("R/utilities.R")
…followed by the code that loads the data, cleans it up, runs the analysis and generates the figures.
Other people have other ideas
Carl Boettiger is an open science advocate who has described his layout in detail. This layout uses R packages for most of the code organisation, and would be a nice approach for large projects.
This article in PLOS Computational Biology describes a general framework.
1.1.3 Treat data as read only
This is probably the most important goal of setting up a project. Data are typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how they have been modified. We suggest to put your data into the data
directory and treat it as read only. Within your scripts you might generate derived data sets either temporarily (in an R session only) or semi-permanently (as a file in output/data
), but the original data is always left in an untouched state.
1.1.4 Treat generated output as disposable
In this approach, files in directory output/
are all generated by the scripts. A nice thing about this approach is that if the file names of generated files change (e.g, changing from phylogeny.pdf
to mammal-phylogeny.pdf
) files with the old names may still stick around, but because they’re in this directory you know you can always delete them. Before submitting a paper, I will go through and delete all the generated files and rerun the analysis to make sure that I can create all the analyses and figures from the data.
1.1.5 Separate function definition and application
When your project is new and shiny, the script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. The actual analysis scripts then become relatively short, and use the functions defined in scripts in R
. Those scripts do nothing but define functions so that they can always be source()
’d by the analysis scripts.
1.2 Setting up a project in RStudio
This gets rid of the #1 problem with most people’s projects face; where do you find the data. Two solutions people generally come up with are:
- Hard code the full file name for each file you load (e.g.,
/Users/barneche/phd/ctotpaper/data/some_data.csv
) - Set the working directory at the beginning of your script file
/Users/barneche/phd/ctotpaper/
then doingread.csv("data/some_data.csv")
The second of these is probably preferable to the first, because the “special case” part is restricted to just one line in your file. However, the project is still now quite fragile, because moving it from one place to another, you must change this file. Some examples of when you might do this:
- Archiving a project (moving it from a “current projects” directory to a new projects directory)
- Giving the code to somebody else (your lab mate, collaborator, supervisor)
- Uploading the code with your manuscript submission for review, or to Dryad after acceptance.
- New computer and new directory layout (especially changing platforms, or if your previous mess got too bad and you wanted to clean up).
- Any number of new reasons
The second case hints at a solution too; if we can start R in a particular directory then we can just use paths relative to the project root and have everything work nicely.
To create a project in R studio:
- “File”: “New Project…”
- choose “New Directory (start a project in a brand new working directory)”.
- Under “Project Type”, choose “New Project”.
- In the “Directory name” type the name for the project. This might be
chapter_n
for a thesis, or something more descriptive likefish_behaviour
. - In the “Create project as a subdirectory of” field select (type or browse) for the parent directory of the project. By default this is probably your home directory, but you might prefer your Documents folder.