Reproducible Workflows in RStudio

Instructor: Ian Carroll

As your research project moves from conception, through data collection and analysis, to reporting and other forms of dissemination, the many components can fracture, lose their development history, and – worst of all – become conflicted.

This lesson gives a high level overview of workflows to organize your project and introduces an accompanying software solution, git.

Top of Section

Vizualizing a Workflow

Credit: Philip Guo

Real Research Workflows

A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science. We can navigate to the results of his careful workflows at www.carlboettiger.info.

The work leading up to his “Pretty Darn Good Control” publication is on GitHub, a website integrated with the git version control system that work together as a system for project management, collaboration and sharing.

Top of Section

Distributed Workflows

A single collaboration model – the centralized workflow – dominates collaborative research. There is a central hub, and everyone synchronizes their work to it. A number of researchers are nodes – consumers of that hub – and synchronize to that one place.

Top of Section

Objectives for this lesson

Learn about a framework for distributed workflows
Identify attributes of reproducible research
Use RStudio to begin managing a workflow

Specific achievements

Access a public repository through RStudio
Create a repository on GitHub
Publish changes to a project file
Execute a few basic R commands

Top of Section

A Plug for Reproducible Research

Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.

Does the same principle apply to data analysis? You bet!

Hallmarks of reproducible research:

Reviewable	All details of the method used are easily accessible for peer review and community review.
Auditable	Records exist to document how the methods and conclusions evolved, but may be private.
Replicable	Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.
Open	The orginator grants permissions for reuse and extension of the research products.

Use your workflow to achieve these same goals:

Reviewable	Write-ups and thoroughly-commented scripts shared among collaborators
Auditable	Versioned project history, used to revert mistakes when necessary
Replicable	“One-click” file & data sharing, as well as streamlined recreation of analyses
Open	GitHub (or similar) based centralized workflow

Top of Section

What’s a GitHub? What’s a “repo”?

Open up the repository that provides the “handouts” for this workshop.

README.md is a Markdown file giving basic information about the repository.
There is a list of files, including a folder for data.
You are looking at a branch called master.
The commit history is available from the top bar.
The “Clone or download” button provides a URL.

Centralized Workflow

Image by Atlassian / CC BY

The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.

Clone the `handouts` repo in RStudio

Create a new project from version control.
Enter the handouts repository URL.
Choose a location where you want a new folder (containing the cloned repository) created.

Image by Atlassian / CC BY

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.

A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.

Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.

Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.

Image by Atlassian / CC BY

Top of Section

RStudio Projects

This software is an example of an integrated development environment and focuses on

Creating projects that use the R programming language, and
Running R language commands or programs in the R interpreter.

R is both a language and an interpreter.

The Console

The interpreter accepts R commands interactively through the console. Basic math are valid commands in the R language:

1 + 2

[1] 3

4^2/sqrt(4)

[1] 8

Question: Why is the output prefixed by [1]?
Answer: That’s the index, or position in a vector, of the first result.

A command giving a vector of results shows this clearly

seq(1, 100)

[1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27
[28]  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
[55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
[82]  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

The Editor

The console is for evaluating commands you don’t intend to keep or reuse. It’s useful for testing commands and poking around. The editor is where you compose scripts that will process data, perform analyses, code up visualizations, and even compose reports.

So we’re not starting from scratch in the editor, let’s use RStudio to clone the handouts repository.

Open up ‘lesson-1.R’ in the editor, and follow along by replacing the ... placeholders with the code here.

vals <- c(5, 6, 12)

The elements of this statement, from right to left are:

) is the closing paren of a function call
5, 6, 12 are three arguments or parameters to the function
( is the opening paren of a function call
c is the name of the function
<- is an operator that assigns what’s named on the left to equal the result of the expression on the right
vals is the name of a variable

Question: Why call vals a “variable” and c a “function”?
Answer: The distinguishing feature is that a function is callable, which is indicated in documentation by writing the function name with empty parens, as in c().

The variable vals held a vector, which if we made into the column of a table we’d have our first proper dataset … of sorts. The most common way of holding data in R is within a data.frame, created by a function of the same name.

data <- data.frame(counts = vals)

Print the data simply by entering it’s name on the console:

data

Or examine its structure with the str() function:

str(data)

'data.frame':3 obs. of  1 variable:
 $ counts: num  5 6 12

Anatomy of a function

The best way to understand the terminology and workings of R is to compose your own function. Like all programming languages, R has keywords that are reserved for import activities, like creating functions. Keywords are usually very intuitive, the one we need is function.

function(...) {
    ...
	return(...)
}

We’ll make a function to extract the first row and column of its argument, for which we can choose an arbitrary name:

function(df) {
    result <- df[[1, 1]]
    return result
}

Note that df doesn’t exist until we call the function, which gives the recipe for how df will be handled.

Finally, we need to give the function a name so we can use it like we used c() and seq() above.

first <- function(df) {
    result <- df[[1, 1]]
    return result
}

first(data)

[1] 5

Question: Can you explain the result of entering first(vals) into the console?
Answer: The function caused an error, which prompted the interpreter to print a helpful error message. Never ignore an error message.

Save your work

Do save the ‘lesson-1.R’ file.

But I mean really save your work, by commiting it to your project and syncing up to a GitHub repository.

Go to the git tab in RStudio
Select commit to open the “Review Changes” window
Select the file(s) you want to commit.
Enter a descriptive message about the commit.
Commit!

Create a GitHub repository

Create a new repository on your GitHub page, name it whatever you like, but leave it empty (no README!).

Once it’s created, find the “Clone or download” URL.

Change the url for the origin repo

The system function in lesson-1.R sends the string directly to the operating system, which uses the git program itself to do something we can’t do through RStudio.

Open the “Review Changes” window again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.

Install missing packages

The last thing we’ll do before taking a break, is let R install any packages you’ll need today. But we’ll learn something new along the way.

requirements <- c('tidyr',
                  'ggplot2',
				  'RSQLite',
				  'rmarkdown')
missing <- setdiff(requirements,
                   rownames(installed.packages()))

Check, from the console, your number of missing packages:

length(missing) == 0

Your result will be TRUE or FALSE, depending on whether you installed all the packages already. We can let the script decide what to do with this information.

The keyword if is part of the R language’s syntax for flow control. The statement in the body (between { and }) only evaluates if the argument (between ( and )) evaluates to TRUE.

if (length(missing) != 0) {
  install.packages(missing)
}

Top of Section

Summary

The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.

Using git to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.

Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.

Top of Section

Reproducible Workflows in RStudio

Vizualizing a Workflow

Real Research Workflows

Distributed Workflows

Objectives for this lesson

Specific achievements

A Plug for Reproducible Research

Hallmarks of reproducible research:

Use your workflow to achieve these same goals:

What’s a GitHub? What’s a “repo”?

Centralized Workflow

Clone the handouts repo in RStudio

RStudio Projects

The Console

The Editor

Anatomy of a function

Save your work

Create a GitHub repository

Change the url for the origin repo

Install missing packages

Summary

Clone the `handouts` repo in RStudio