The issue of reproducibility has been getting a lot of attention in the R community, and I have learned a lot from the ongoing discussions. So these days, when I seriously analyze data in R, I bundle my materials in a reproducible workflow to smooth changes during development and to help others replicate the results. These are my thoughts so far, which I explain in terms of a natural progression of tools.

knitr

Yihui Xie’s knitr package is the state of the art of reproducibility in R. Code is dynamically woven into reports, and code chunks can be cached to speed development. I have relied on knitr frequently for tutorials, minimal working examples, graduate school homework assignments, presentations about code, and other pedagogical tasks that require as much code as possible to be on display. However, for serious projects with thousands of lines of code and computations that take several hours or days, I believe knitr is not the best tool for managing the entire workflow. For me, even with caching and externalized code, a medium-to-large project quickly becomes cumbersome when implemented as a knitr report from start to finish. To stay clean, organized, and modular, to open up more high-performance parallel computing options, and to ensure that document compilation issues do not interfere with the rest of the project, I turn to other reproducible build systems.

GNU Make

Programmers have been using GNU Make for decades to compile low-level code, and the utility extends to programmable workflows in general. The main selling point is that the intermediate stages of a project are (re)built if and only if they do not exist or their dependencies change. Here’s an example that makes a file called plot.pdf using R scripts. I write the steps to make plot.pdf in a file called Makefile below.

all: plot.pdf

plot.pdf: plot.R random.csv mtcars.csv
	Rscript plot.R

mtcars.csv: mtcars.R
	Rscript mtcars.R

random.csv: random.R
	Rscript random.R

clean:
	rm -f mtcars.csv random.csv plot.pdf

Above, plot.pdf depends on plot.R, random.csv, and mtcars.csv. With those dependencies available, plot.pdf is constructed by executing Rscript plot.R in the command line. Similarly, mtcars.csv depends on mtcars.R and is constructed by calling Rscript mtcars.R in the command line. The case of random.csv is analogous.

Below are the contents of plot.R,

x <- read.csv("mtcars.csv")[["mpg"]]
y <- read.csv("random.csv")[["y"]]
pdf("plot.pdf")
plot(x, y)
dev.off()

mtcars.R,

data(mtcars)
write.csv(mtcars, "mtcars.csv")

and random.R.

d = data.frame(y = rnorm(32))
write.csv(d, "random.csv")

With all these files, I open a command line program such as Terminal, type make, and press enter. The files are generated in sequence, and the console shows the following.

Rscript random.R
Rscript mtcars.R
Rscript plot.R
null device
          1

If I wanted to distribute this workflow over two parallel processes, I could have typed make -j 2. In that case, mtcars.csv and random.csv would have been computed simultaneously, and then plot.pdf would have been created afterwards.

If I run make again, I get

make: Nothing to be done for 'all'.

"Targets" such as plot.pdf, mtcars.csv, and random.csv are (re)built if and only if they do not exist or their dependencies change. For example, suppose I change mtcars.R and rerun make. Then, mtcars.csv and plot.pdf are rebuilt, but random.csv is left alone.

Rscript mtcars.R
Rscript plot.R
null device
          1

Make saves a lot of time, but the solution is incomplete. To get the most out of Make's conditional rebuild policy, you'll have to split up your R code into as many separate files as possible and collate them yourself, which could get messy. Also, adding comments and whitespace triggers unnecessary rebuilds. A better solution for R is Rich FitzJohn's remake.

remake

Rich FitzJohn's remake package is like GNU Make for a single R session, and it removes much of the awkwardness of combining Make and R. To run the previous example, I first create the YAML file remake.yml below, which is analogous to a Makefile.

sources: code.R

targets:

  all:
    depends: plot.pdf

  plot.pdf:
    command: my_plot(mtcars, random)
    plot: TRUE

  mtcars:
    command: my_mtcars()

  random:
    command: my_random()

I also maintain a source file called code.R.

my_mtcars = function(){
  data(mtcars)
  mtcars
}

my_random = function(){
  data.frame(y = rnorm(32))
}

my_plot = function(mtcars, random){
  plot(mtcars$mpg, random$y)
}

Then, to build the project, I open an R session and run the following.

library(remake)
make()

Only outdated or missing targets are (re)built in future calls to make(). For example, changing the contents of my_random() does not trigger a rebuild of mtcars. In addition, you can add comments and whitespace throughout code.R without triggering unnecessary rebuilds. I also like that I don't have to micromanage intermediate files. Rather than saving CSV files, I just let remake compute objects mtcars and random and automatically maintain them in a hidden folder called .remake/objects, managed internally by storr.

Remake has some idiosyncrasies, and commands must be specified carefully. For example, use the I() for string literals, and with the exception of I(), avoid writing nested functions as commands. For more information, please refer to TROUBLESHOOTING.md for remakeGenerator.

Remake is amazing, but it still has room for improvement. For one, it currently does not support parallel computing, so there is not a built-in way to distribute targets over parallel processes as with make -j. In addition, the user needs to maintain a YAML file that could get large and complicated for serious projects. With these issues in mind, I wrote the parallelRemake and remakeGenerator packages.

parallelRemake

The parallelRemake package is basically remake plus parallel computing. You begin with a YAML file such as remake.yml as in the above example, and then you write a Makefile that can run remake targets in parallel. Using the previous example, I first open an R session and call write_makefile().

library(parallelRemake)
write_makefile()

That produces the Makefile shown below.

# Generated by parallelRemake::write_makefile() on 2016-06-14 08:45:47

.PHONY: all plot.pdf mtcars random clean

all: plot.pdf

plot.pdf: mtcars random
	Rscript -e 'remake::make("plot.pdf", remake_file = "remake.yml")'

mtcars:
	Rscript -e 'remake::make("mtcars", remake_file = "remake.yml")'

random:
	Rscript -e 'remake::make("random", remake_file = "remake.yml")'

clean:
	Rscript -e 'remake::make("clean", remake_file = "remake.yml")'
	rm -rf .remake

I can now run the whole project in two parallel processes with make -j 2, and the targets mtcars and random are built simultaneously.

You can set up and run this example from start to finish with

library(parallelRemake)
run_example_parallelRemake()

If you want to run make -j to distribute tasks over multiple nodes of a Slurm cluster, refer to the Makefile in this post and write

write_makefile(...,
  begin = c(
    "SHELL=srun",
    ".SHELLFLAGS= [ARGS] bash -c"))
in an R session, where [ARGS] stands for additional arguments to srun. Then, once the Makefile is generated, you can run the workflow with nohup nice -19 make -j [N] & in the command line, where [N] is the number of simultaneous tasks. For other task managers such as PBS, such an approach may not be possible. Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake storr cache.

If your job scheduler is not Slurm, you may need to write a custom stand-in for a shell. With the Univa Grid Engine, for instance, you would call

write_makefile(..., begin = "SHELL=./shell.sh")

where shell.sh contains

#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y

Be sure to allow execution with

chmod +x shell.sh

and then distribute over [N] jobs with

nohup nice -19 make -j [N] &

As before, intermediate R objects are stored in .remake/objects, a cache managed internally by storr. To load these objects into an R session for debugging and planning purposes, use the recall() function. To see the objects available, use recallable().

remakeGenerator

The remakeGenerator package is for generating remake-style workflows with minimal code. It is especially useful for analyzing multiple datasets multiple ways reproducibly in computationally-demanding workflows under heavy development. Let's dive into an example, which you can run from start to finish with

library(remakeGenerator)
example_remakeGenerator(1)
The online file code.R has all the building blocks (i.e. user-defined functions), and workflow.R, also below, is the master plan.

library(remakeGenerator)

datasets = commands(
  normal16 = normal_dataset(n = 16),
  poisson32 = poisson_dataset(n = 32),
  poisson64 = poisson_dataset(n = 64))

analyses = analyses(
  commands = commands(
    linear = linear_analysis(..dataset..),
    quadratic = quadratic_analysis(..dataset..)),
  datasets = datasets)

summaries = summaries(
  commands = commands(
    mse = mse_summary(..dataset.., ..analysis..),
    coef = coefficients_summary(..analysis..)),
  analyses = analyses, datasets = datasets, gather = strings(c, rbind))

output = commands(coef.csv = write.csv(coef, target_name))

plots = commands(mse.pdf = hist(mse, col = I("black")))
plots$plot = TRUE

reports = data.frame(target = strings(markdown.md, latex.tex),
  depends = c("poisson32, coef, coef.csv", ""))
reports$knitr = TRUE

targets = targets(datasets = datasets, analyses = analyses, summaries = summaries,
  output = output, plots = plots, reports = reports)

workflow(targets, sources = "code.R", packages = "MASS",
  begin = c("# Prepend this", "# to the Makefile."))

###############################################
### Now, run remake::make() or the Makefile ###
###############################################

To distribute the workflow over multiple parallel processes, use the Makefile and run the project with make -j [n], where [n] is the number of processes. To deploy parallel processes across multiple nodes of a Slurm cluster, refer to the parallelRemake instructions.

As before, changes to the functions in code.R trigger rebuilds only for the targets that need updating, and the recall() and recallable() functions from parallelRemake can be used to access the storr cache.

There is a second example that opens up more extensibility for workflows that don't fit the mold. See the README for more information.

downsize

With the downsize package, you can toggle the test and production versions of your workflow with the flip of a TRUE/FALSE global option. This is helpful when your workflow takes a long time to run, you want to test it quickly, and unit testing is too reductionist to cover everything. Just intersperse your code with calls to the downsize() function. For example, say you want to analyze a large dataset.

big_data <- data.frame(x = rnorm(1e4), y = rnorm(1e4))
But for the sake of time, you want to test and debug your code on a smaller dataset. In your code, select your dataset with a call to `downsize()`.
library(downsize)
my_data <- downsize(big_data)
The downsize() function provides a choice. Above, my_data becomes big_data if getOption("downsize") is FALSE or NULL (default) and head(big_data) if getOption("downsize") is TRUE. You can toggle the global option downsize with a call to production_mode() or test_mode(), and you can override the option with downsize(..., downsize = L), where L is TRUE or FALSE. Check if you are in test mode or production mode with the my_mode() function.

For instance, take the following script.

library(downsize)
test_mode() # scales the workflow appropriately
my_mode() # shows if the workflow is test or production mode
big_data <- data.frame(x = rnorm(1e4), y = rnorm(1e4)) # always large
my_data <- downsize(big_data) # either large or small
nrow(my_data) # responds to test_mode() and production_mode()
# ...more code, time-consuming if my_data is large...
The workflow is in test mode because test_mode() is called. To scale up to production mode, replace test_mode() with production_mode() and keep everything else exactly the same. You can verify the change by seeing that nrow(my_data) returns 10000 instead of 6.

An ideal workflow has multiple calls to downsize() that are configured all at once with a single call to test_mode() or production_mode() at the very beginning. Thus, tedium and human error are avoided, and the test is a close approximation to the actual task at hand.

You also have the option to provide a replacement for big_data.

small_data <- data.frame(x = runif(16), y = runif(16))
my_data <- downsize(big_data, small_data)

Similarly, you can conditionally subset your data or toggle entire code blocks based on the mode (test vs production) of the workload. See the package vignette for examples.

The downsize package is compatible with remakeGenerator workflows, and the remakeGenerator vignette explains how to use the two together.

packrat

In mid-August of 2016, Eric Nantz of the R-Podcast converted me to packrat (by Kevin Ushey and others at RStudio), a package that lengthens the shelf life of R projects. Packrat maintains local snapshots of dependencies so that your project won't break when external packages are updated. Packrat is fully compatible with all the tools I explained in this post. Just be sure your current working directory is the root directory of your project when you run remake::make() or the Makefile. Also, if you use a shell.sh with your Makefile, be sure to modify module load R so that it points to the version of R corresponding to your packrat library. You can learn more about packrat with the hands-on walkthrough.

The R package system

Until recently, I implemented reproducible workflows as formal R packages for the sake of portability, quality control, standardized documentation, and automated testing. Unfortunately, as I show here, remake does not currently play nicely with custom packages. Still, for now, I happily choose remake and my companion toolkit.

ProjectTemplate

The ProjectTemplate package is structurally similar to the R package system, but with an explicit focus on user-defined workflows. In addition, whereas remake focuses on the reproducibility of declarative workflows, ProjectTemplate improves the file management and ergonomics of interactive ones. In the future, ProjectTemplate and remake might work together, and you can follow updates here and here.