Reproducible Workflows in RStudio

Instructor: Ian Carroll

As your research project moves from conception, through data collection and analysis, to reporting and other forms of dissemination, the many components can fracture, lose their development history, and – worst of all – become conflicted.

This lesson gives a high level overview of workflows to organize your project and introduces an accompanying software solution, git.

Top of Section


Vizualizing a Workflow


Credit: Philip Guo

Real Research Workflows

A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science. We can navigate to the results of his careful workflows at www.carlboettiger.info.

The work leading up to his “Pretty Darn Good Control” publication is on GitHub, a website integrated with the git version control system that work together as a system for project management, collaboration and sharing.

Top of Section


Distributed Workflows

A single collaboration model – the centralized workflow – dominates collaborative research. There is a central hub, and everyone synchronizes their work to it. A number of researchers are nodes – consumers of that hub – and synchronize to that one place.

Top of Section


Objectives for this lesson

Specific achievements

Top of Section


A Plug for Reproducible Research

Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.

Does the same principle apply to data analysis? You bet!

Hallmarks of reproducible research:

Reviewable All details of the method used are easily accessible for peer review and community review.
Auditable Records exist to document how the methods and conclusions evolved, but may be private.
Replicable Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.
Open The orginator grants permissions for reuse and extension of the research products.

Use your workflow to achieve these same goals:

Reviewable Write-ups and thoroughly-commented scripts shared among collaborators
Auditable Versioned project history, used to revert mistakes when necessary
Replicable “One-click” file & data sharing, as well as streamlined recreation of analyses
Open GitHub (or similar) based centralized workflow

Top of Section


What’s a GitHub? What’s a “repo”?

Open up the repository that provides the “handouts” for this workshop.

Centralized Workflow


Image by Atlassian / CC BY

The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.

Clone the handouts repo in RStudio


Image by Atlassian / CC BY

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.

A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

Top of Section


RStudio Projects

This software is an example of an integrated development environment and focuses on

  1. Creating projects that use the R programming language, and
  2. Running R language commands or programs in the R interpreter.

R is both a language and an interpreter.

The Console

The interpreter accepts R commands interactively through the console. Basic math are valid commands in the R language:

1 + 2
[1] 3
4^2/sqrt(4)
[1] 8
Question
Why is the output prefixed by [1]?
Answer
That’s the index, or position in a vector, of the first result.

A command giving a vector of results shows this clearly

seq(1, 100)
[1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27
[28]  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
[55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
[82]  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

The Editor

The console is for evaluating commands you don’t intend to keep or reuse. It’s useful for testing commands and poking around. The editor is where you compose scripts that will process data, perform analyses, code up visualizations, and even compose reports.

So we’re not starting from scratch in the editor, let’s use RStudio to clone the handouts repository.

Open up ‘lesson-1.R’ in the editor, and follow along by replacing the ... placeholders with the code here.

vals <- c(5, 6, 12)

The elements of this statement, from right to left are:

Question
Why call vals a “variable” and c a “function”?
Answer
The distinguishing feature is that a function is callable, which is indicated in documentation by writing the function name with empty parens, as in c().

The variable vals held a vector, which if we made into the column of a table we’d have our first proper dataset … of sorts. The most common way of holding data in R is within a data.frame, created by a function of the same name.

data <- data.frame(counts = vals)

Print the data simply by entering it’s name on the console:

data
  counts
1      5
2      6
3     12

Or examine its structure with the str() function:

str(data)
'data.frame':3 obs. of  1 variable:
 $ counts: num  5 6 12	 

Anatomy of a function

The best way to understand the terminology and workings of R is to compose your own function. Like all programming languages, R has keywords that are reserved for import activities, like creating functions. Keywords are usually very intuitive, the one we need is function.

function(...) {
    ...
	return(...)
}

We’ll make a function to extract the first row and column of its argument, for which we can choose an arbitrary name:

function(df) {
    result <- df[[1, 1]]
    return result
}

Note that df doesn’t exist until we call the function, which gives the recipe for how df will be handled.

Finally, we need to give the function a name so we can use it like we used c() and seq() above.

first <- function(df) {
    result <- df[[1, 1]]
    return result
}
first(data)
[1] 5
Question
Can you explain the result of entering first(vals) into the console?
Answer
The function caused an error, which prompted the interpreter to print a helpful error message. Never ignore an error message.

Save your work

Do save the ‘lesson-1.R’ file.

But I mean really save your work, by commiting it to your project and syncing up to a GitHub repository.

  1. Go to the git tab in RStudio
  2. Select commit to open the “Review Changes” window
  3. Select the file(s) you want to commit.
  4. Enter a descriptive message about the commit.
  5. Commit!

Create a GitHub repository

Create a new repository on your GitHub page, name it whatever you like, but leave it empty (no README!).

Once it’s created, find the “Clone or download” URL.

Change the url for the origin repo

The system function in lesson-1.R sends the string directly to the operating system, which uses the git program itself to do something we can’t do through RStudio.

Open the “Review Changes” window again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.

Install missing packages

The last thing we’ll do before taking a break, is let R install any packages you’ll need today. But we’ll learn something new along the way.

requirements <- c('tidyr',
                  'ggplot2',
				  'RSQLite',
				  'rmarkdown')
missing <- setdiff(requirements,
                   rownames(installed.packages()))

Check, from the console, your number of missing packages:

length(missing) == 0

Your result will be TRUE or FALSE, depending on whether you installed all the packages already. We can let the script decide what to do with this information.

The keyword if is part of the R language’s syntax for flow control. The statement in the body (between { and }) only evaluates if the argument (between ( and )) evaluates to TRUE.

if (length(missing) != 0) {
  install.packages(missing)
}

Top of Section


Summary

The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.

Using git to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.

Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.

Top of Section