Reproducible Workflows in RStudio
Instructor: Ian Carroll
As your research project moves from conception, through data collection and analysis, to reporting and other forms of dissemination, the many components can fracture, lose their development history, and – worst of all – become conflicted.
This lesson gives a high level overview of workflows to organize your project and introduces an accompanying software solution, git
.
Vizualizing a Workflow
Real Research Workflows
A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science. We can navigate to the results of his careful workflows at www.carlboettiger.info.
The work leading up to his “Pretty Darn Good Control” publication is on GitHub, a website integrated with the git
version control system that work together as a system for project management, collaboration and sharing.
Distributed Workflows
A single collaboration model – the centralized workflow – dominates collaborative research. There is a central hub, and everyone synchronizes their work to it. A number of researchers are nodes – consumers of that hub – and synchronize to that one place.
Objectives for this lesson
- Learn about a framework for distributed workflows
- Identify attributes of reproducible research
- Use RStudio to begin managing a workflow
Specific achievements
- Access a public repository through RStudio
- Create a repository on GitHub
- Publish changes to a project file
- Execute a few basic R commands
A Plug for Reproducible Research
Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.
Does the same principle apply to data analysis? You bet!
Hallmarks of reproducible research:
Reviewable | All details of the method used are easily accessible for peer review and community review. |
Auditable | Records exist to document how the methods and conclusions evolved, but may be private. |
Replicable | Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork. |
Open | The orginator grants permissions for reuse and extension of the research products. |
Use your workflow to achieve these same goals:
Reviewable | Write-ups and thoroughly-commented scripts shared among collaborators |
Auditable | Versioned project history, used to revert mistakes when necessary |
Replicable | “One-click” file & data sharing, as well as streamlined recreation of analyses |
Open | GitHub (or similar) based centralized workflow |
What’s a GitHub? What’s a “repo”?
Open up the repository that provides the “handouts” for this workshop.
- README.md is a Markdown file giving basic information about the repository.
- There is a list of files, including a folder for data.
- You are looking at a branch called
master
. - The commit history is available from the top bar.
- The “Clone or download” button provides a URL.
Centralized Workflow
The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.
Clone the handouts
repo in RStudio
- Create a new project from version control.
- Enter the handouts repository URL.
- Choose a location where you want a new folder (containing the cloned repository) created.
Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.
A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.
A push copies local commits to the origin and applies them remotely.
RStudio Projects
This software is an example of an integrated development environment and focuses on
- Creating projects that use the R programming language, and
- Running R language commands or programs in the R interpreter.
R is both a language and an interpreter.
The Console
The interpreter accepts R commands interactively through the console. Basic math are valid commands in the R language:
1 + 2
[1] 3
4^2/sqrt(4)
[1] 8
- Question
- Why is the output prefixed by
[1]
? - Answer
- That’s the index, or position in a vector, of the first result.
A command giving a vector of results shows this clearly
seq(1, 100)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[28] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
[82] 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
The Editor
The console is for evaluating commands you don’t intend to keep or reuse. It’s useful for testing commands and poking around. The editor is where you compose scripts that will process data, perform analyses, code up visualizations, and even compose reports.
So we’re not starting from scratch in the editor, let’s use RStudio to clone the handouts repository.
Open up ‘lesson-1.R’ in the editor, and follow along by replacing the ...
placeholders with the code here.
vals <- c(5, 6, 12)
The elements of this statement, from right to left are:
)
is the closing paren of a function call5, 6, 12
are three arguments or parameters to the function(
is the opening paren of a function callc
is the name of the function<-
is an operator that assigns what’s named on the left to equal the result of the expression on the rightvals
is the name of a variable
- Question
- Why call
vals
a “variable” andc
a “function”? - Answer
- The distinguishing feature is that a function is callable, which is indicated in documentation by writing the function name with empty parens, as in
c()
.
The variable vals
held a vector, which if we made into the column of a table we’d have our first proper dataset … of sorts. The most common way of holding data in R is within a data.frame
, created by a function of the same name.
data <- data.frame(counts = vals)
Print the data simply by entering it’s name on the console:
data
counts
1 5
2 6
3 12
Or examine its structure with the str()
function:
str(data)
'data.frame':3 obs. of 1 variable:
$ counts: num 5 6 12
Anatomy of a function
The best way to understand the terminology and workings of R is to compose your own function. Like all programming languages, R has keywords that are reserved for import activities, like creating functions. Keywords are usually very intuitive, the one we need is function
.
function(...) {
...
return(...)
}
We’ll make a function to extract the first row and column of its argument, for which we can choose an arbitrary name:
function(df) {
result <- df[[1, 1]]
return result
}
Note that df
doesn’t exist until we call the function, which gives the recipe for how df
will be handled.
Finally, we need to give the function a name so we can use it like we used c()
and seq()
above.
first <- function(df) {
result <- df[[1, 1]]
return result
}
first(data)
[1] 5
- Question
- Can you explain the result of entering
first(vals)
into the console? - Answer
- The function caused an error, which prompted the interpreter to print a helpful error message. Never ignore an error message.
Save your work
Do save the ‘lesson-1.R’ file.
But I mean really save your work, by commiting it to your project and syncing up to a GitHub repository.
- Go to the
git
tab in RStudio - Select
commit
to open the “Review Changes” window - Select the file(s) you want to commit.
- Enter a descriptive message about the commit.
- Commit!
Create a GitHub repository
Create a new repository on your GitHub page, name it whatever you like, but leave it empty (no README!).
Once it’s created, find the “Clone or download” URL.
Change the url for the origin repo
The system
function in lesson-1.R
sends the string directly to the operating system, which uses the git
program itself to do something we can’t do through RStudio.
Open the “Review Changes” window again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.
Install missing packages
The last thing we’ll do before taking a break, is let R install any packages you’ll need today. But we’ll learn something new along the way.
requirements <- c('tidyr',
'ggplot2',
'RSQLite',
'rmarkdown')
missing <- setdiff(requirements,
rownames(installed.packages()))
Check, from the console, your number of missing packages:
length(missing) == 0
Your result will be TRUE
or FALSE
, depending on whether you installed all the packages already. We can let the script decide what to do with this information.
The keyword if
is part of the R language’s syntax for flow control. The statement in the body (between {
and }
) only evaluates if the argument (between (
and )
) evaluates to TRUE.
if (length(missing) != 0) {
install.packages(missing)
}
Summary
The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.
Using git
to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.
Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.