Collaborative Workflows & Reproducible Pipelines
Lesson 2 with Ian Carroll
Contents
- Objectives for this lesson
- Centralized Workflow
- A Plug for Reproducible Research
- Reproducible research: the end result
- What’s a GitHub? What’s a “repo”?
- Get Started with GitHub
- RStudio Projects
- Create a new file
- Working with Collaborators
- Where to from here?
Objectives for this lesson
- Learn about centralized workflows
- Identify attributes of reproducible research
- See how RStudio + git facilitates collaboration
Specific achievements
- Create a git repository on GitHub
- Make commits to a project file
- Synchronize to a repository with RStudio
- Collaborate with another GitHub user
Overview
As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and—worst of all—become conflicted or lost.
This lesson gives a high level strategy to organize your collaborative workflow and introduces accompanying software and cloud solutions.
Centralized Workflow
One strategy for distributed work among a team of scientists—the centralized workflow—dominates collaborative research.
A central hub stores project files and their history. Researchers are spokes on the wheel, working on private copies of the project. Project integrity is maintained by rules, enforced at the hub, for synchronizing between hub and spokes.
A Plug for Reproducible Research
Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.
Does the same principle apply to modeling and data analysis? You bet!
Hallmarks of reproducible research:
Reviewable | All details of the method used are easily accessible for peer review and community review. |
Auditable | Records exist to document how the methods and conclusions evolved, but may be private. |
Replicable | Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork. |
Open | The orginator grants permissions for reuse and extension of the research products. |
Let your workflow help achieve these same goals:
Thoroughly-comment scripts and share continusously with collaborators | Reviewable |
Maintain project history to correct mistakes when necessary | Auditable |
Provide “one-click” file & data sharing, of a streamlined analysis “pipeline” | Replicable |
Publically release on GitHub (or similar) with (implied) open licensing | Open |
Reproducible research: the end result
A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science who publishes clear & reproducible pipelines. The work leading up to the paper “Pretty Darn Good Control” is on GitHub, a website integrated with the git
version control system.
Integrating git
with cloud services like GitHub creates a complete system for project management, collaboration and sharing.
Using collaborative workflows …
… to construct a single pipeline
What’s a GitHub? What’s a “repo”?
Open up the repository that provides the “handouts” for this workshop.
README.md
is a Markdown file giving basic information about the repository.- There is a list of files, including a folder for data.
- You are looking at a branch called
master
. - The commit history is available from the top bar.
- The “Clone or download” button provides a URL.
Centralized Workflow
The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.
Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.
A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.
A push copies local commits to the origin and applies them remotely.
A push copies local commits to the origin and applies them remotely.
Get Started with GitHub
Sign in or create a GitHub account.
Create a GitHub repository
Create a new repository on your GitHub page.
- Name the repository “handouts”.
- Add a short “tag line” for your Summer Institute experience.
- Leave all the boxes (includeing the “README”) un-checked.
Empty repository
You have created an empty repository. The quick start information is emphasizing that you’ll need the URL beginning with “git@github…”. We’ll get to that in a minute.
RStudio Projects
This software is an example of an integrated development environment and focuses on
- Editing scripts written in the R language.
- Running R language commands or programs in the R interpreter.
- Helping to manage many components of a collaborative project using version control.
Integration with git
RStudio provides convenient access to the core tools provided by git
, so any project can also be a repository. Under the File menu, create a new project from a remote version control repository.
Configure git
Every commit has an author. For GitHub to attribute commits to your account, configure git
with your GitHub username and associated e-mail address.
## Configure git
git config --global user.name ...
git config --global user.email ...
Configure your clone
The “handouts” repository is currently linked—via URL—to the “hub” you cloned from SESYNC’s repositories on GitHub. To transfer the repository to the newly created repository owned by you, set the URL to the one provided and push all the things.
## Change the "origin" remote URL and push
git remote set-url origin ...
git push --all
Save your worksheet-2.sh
and select “Run Script” to execute these shell commands.
Files under version control
A scripted pipeline relies heavilly on plain text files (the scripts), but may include different file types for figures or data. Any file in this directory that is under version control is monitored for differences from the committed state of the project. Files must be added to at least one commit before they are tracked.
Commit & push
The first change you made to the handouts repository are your edits to worksheet-2.sh
. You have saved them, but you haven’t committed them to the repository.
- On the
git
tab in RStudio, select commit - Check the modifications to “Stage”
- Add a commit message
- Commit
- Push
Create a new file
Create a new text file in the RStudio editor as below, adding yourself as the first collaborator.
## Project Collaborators
- ...
- My neighbor!
In the final part of the lesson, we’ll have a project collaborator replace “My neighbor!” with his or her name.
Track it with git
Before you can commit changes involving a new file, you have to tell the version control system (that’s git
!) to watch it..
- Go to the
git
tab in RStudio - Select
commit
to open the “Review Changes” window - Select “Staged” to add (hence “A”) the new file.
- Enter a descriptive message about the commit.
- Commit!
Push
Look at the git
tab again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.
Exercise 1
Modify the collaborators.md
file again to add a third “TBD” collaborator, and push the modification to the origin as one commit.
Working with Collaborators
True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges.
- Be it data, a script, or a write-up, who has the most up-to-date version?
- Will a teammate’s work overwrite any of your own?
- How do I recover the working version of code the PI broke?
Centralized workflows, managed by git
, helps answer these questions.
Project Integrity
- The origin becomes the official up-to-date repo, even if your work is a few commits ahead.
- Diverging files are easily reintegrated with a merge algorithm.
- The complete project history is available to checkout.
Note, version control works really well with text. Non-textual components of your project (e.g. large or binary data) need advanced treatment.
The first step to collaborative workflows is granting access to the origin of your project. Introduce yourself to your neighbour, and ask for his/her GitHub username.
Add your neighbour as a collaborator, and accept your neighbours invitation to collaborate!
Editing on GitHub
Go to your neighbors repository on GitHub, and open collaborators.md
. The text below shows “My Neighbor!” where you should see your neighbor’s name. Edit the file in your neighbour’s repo, by replacing the remaining “My neighbor!” with your own name.
## Project Collaborators
- My neighbor!
- ...
- TBD
Always write a meaningful commit message when you save!
Exercise 2
Create a new RStudio project from your neighbour’s repository. Note the name you choose during project creation in RStudio does not have to be “handouts”, i.e. it does not have to match the name of the repostitory on GitHub. Make further changes to the collaborators.md
file, then commit & push.
Where to from here?
-
The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.
-
Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “gold standard” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.
-
Using advanced
git
to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.