Collaborative & Reproducible Workflows
Lesson 1 with Ian Carroll
This lesson gives a high level overview of workflows to organize your project and introduces an accompanying software solution, git
.
Workflow diagram
A real research workflow
A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science who publishes clear & reproducible workflows. The work leading up to “Pretty Darn Good Control” is on GitHub, a website integrated with the git
version control system.
Integrating git
with cloud services like GitHub creates a complete system for project management, collaboration and sharing.
A single collaboration model – the centralized workflow – dominates collaborative research. There is a central hub, and everyone synchronizes their work to it. A number of researchers are nodes – consumers of that hub – and synchronize to that one place.
Objectives for this lesson
- Learn about a framework for distributed workflows
- Identify attributes of reproducible research
- Use RStudio to begin managing a workflow
Specific achievements
- Access a public repository through RStudio
- Create a repository on GitHub
- Publish changes to a project file
- Execute a few basic R commands
A Plug for Reproducible Research
Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.
Does the same principle apply to data analysis? You bet!
Hallmarks of reproducible research:
Reviewable | All details of the method used are easily accessible for peer review and community review. |
Auditable | Records exist to document how the methods and conclusions evolved, but may be private. |
Replicable | Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork. |
Open | The orginator grants permissions for reuse and extension of the research products. |
Use your workflow to achieve these same goals:
Reviewable | Write-ups and thoroughly-commented scripts shared among collaborators |
Auditable | Versioned project history, used to revert mistakes when necessary |
Replicable | “One-click” file & data sharing, as well as streamlined recreation of analyses |
Open | GitHub (or similar) based centralized workflow |
Also, there’s this …
What’s a GitHub? What’s a “repo”?
Open up the repository that provides the “handouts” for this workshop.
- README.md is a Markdown file giving basic information about the repository.
- There is a list of files, including a folder for data.
- You are looking at a branch called
master
. - The commit history is available from the top bar.
- The “Clone or download” button provides a URL.
Centralized Workflow
The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.
Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.
A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.
A push copies local commits to the origin and applies them remotely.
RStudio Projects
This software is an example of an integrated development environment and focuses on
- Creating projects that use the R programming language, and
- Running R language commands or programs in the R interpreter.
R is both a language and an interpreter.
Integration with git
RStudio provides convenient access to the core tools provided by git
, so any project can also be a repository.
Under the File menu, create a new project from a remote version control repository.
Files under version control
Software is written in plain text, and version control is design for software development. A scripted workflow relies heavilly on plain text files, but may include different file types for figures or data.
For this reason, a plain text editor is a core element of the IDE. The editor in RStudio is good for any kind of text documents: you could edit R scripts, C++ code, LaTeX documents, or even CSV files.
README.md
To begin making this project your own, modify the README. Tell us something about why you’re here!
# SESYNC Computational Synthesis Institute
Goals
- ...
The “.md” extension stands for “markdown”, which is a syntax for simple plain text “formatting”.
Save your work
Do save the README.md
file.
But I mean really save your work, by commiting it to your project with a version control system (that’s git
!).
- Go to the
git
tab in RStudio - Select
commit
to open the “Review Changes” window - Select the file(s) you want to commit.
- Enter a descriptive message about the commit.
- Commit!
Create a GitHub repository
Create a new repository on your GitHub page, name it whatever you like, but leave it empty (no README!).
Once it’s created, find the “Clone or download” URL beginning with “https://”.
Configure git
The system
function in lesson-1.R
sends the string directly to the operating system, which uses the git
program itself to do something we can’t do through RStudio.
# Configure git
system("git config --global user.name "Ian Carroll")
system("git config --global user.name icarroll@sesync.org)
- Question
- Why did I put spaces around my name but not my e-mail
- Answer
- A space usually means the end of a string, the quotes are an alternative way to demarcate the bounds of a string.
Change origin repo
Open “Tools” > “Project Options” > “Git/SVN”, notice that the orign is the SESYNC-CI organization’s URL.
# Set a new origin URL
system("git remote set-url origin https://github.com/%username%/%repo%")
Push
Open the “Review Changes” window again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.
Where to from here?
The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.
Using git
to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.
Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.