Collaborative Workflows & Reproducible Pipelines

Lesson 3 with Ian Carroll

Centralized Workflow

As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and—worst of all—become conflicted or lost. This lesson introduces a high level strategy for organizing your collaborative workflow, along with the necessary software and cloud solutions. Called “the centralized workflow”, this strategy targets distributed work by equal contributors on a shared codebase and is widespread in collaborative research.

A central hub stores project files and their history. Researchers are spokes on the wheel, each working on a local clone of the project. Project integrity is maintained by rules for synchronizing commits between the hub and spokes when users execute a push or pull on their local clone.

Top of Section

Reproducible Pipeline

The result of reproducible research is more than a published paper, it includes the whole data-to-document pipeline. In a typical socio-environmental synthesis project, a finished pipeline includes the following steps:

Aquire raw data from online repository.
Extract, transform, and load data into storage for analysis.
Perform data analysis (e.g. model inference) and visualization.
Update documentation and reports.
Publish results, including reports, data and code/software.

A UC Berkeley professor who is a strong advocate for open science, Carl Boettiger, has released several reproducible pipelines on GitHub. For example, check out his work leading up to the paper “Pretty Darn Good Control” in the project pdg_control.

Workflow vs. Pipeline (a weak analogy)

Workflow describes how your team collaboratively creates the code, software environment and integrations that comprise the pipeline. By analogy to a physical pipeline that moves raw material to finished product, your workflow involves everything from drafting plans to testing the product.

Collaborative workflows require communication—developing a pipeline under version control facilitates it.

Top of Section

Lesson Objectives

Learn about centralized workflows
Identify attributes of reproducible pipelines
See how RStudio + git facilitate collaboration

Specific Achievements

Make “commits” to a repository with git
“Push” and “pull” project work to GitHub
“Merge” your work with a collaborator’s via GitHub

Top of Section

RStudio + git

RStudio provides a GUI to the core tools provided by git. Login to your RStudio Server account and upload handouts.zip. Click on “handouts.Rproj” to open the directory as a “project”.

Create a Spoke

Any RStudio “project” can also be a local git repository. Under the File menu, create a new project from the existing directory of worksheets and other files.

If the project already existed on GitHub or another git server, you would instead want to clone the poject by choosing “Version Control”.

Initialize git

Convert your RStudio project to a git repository by enabling version control, available from the file menu under Tools > Version Control > Project Setup.

Adding a git repository creates a hidden folder in your project called “.git”, storing all the data about your project’s current and past state.

Commit

Once RStudio refreshes your project, there will be a “Git” tab in the same window as the Environment tab. The window shows files that have content not already commited in the current state of your project. Choose “Commit” to open a new window for easy staging and commiting.

check README.md and handouts.Rproj
write a commit message
commit (but heed the warning!)

Saving, staging, and commiting are each separate steps, none of which imply any of the others. This may seem like a hassle, but is a good thing! As your project grows larger, you will frequently save changes you don’t want to commit: staging lets you choose what changes get packaged into a commit.

History

The history of your project shows a single commit, every new commit will be chained on top of a preceding commit. Note the “Author” data is probably not going to be recognized by GitHub and linked to your account.

For GitHub to associate commits with your account, configure git with your GitHub username and email address.

git config --global user.name itcarroll
git config --global user.email icarroll@sesync.org
git commit --no-edit --amend --reset-author

Revisit the commit history to confirm that the author information has been amended for the first commit. In the future, configure your user.name and user.email before starting a project, so you do not have to ammend any commits.

Create the Hub

Give your repository the same name as your RStudio project.
Add a short “tag line” about your workshop experience.
Do not check either box.

You have created a repository that has no history—it will accept the commits made in RStudio without conflict. The quick start information provided by GitHub explains how to finish configuration of your local git repo.

git remote add origin https://github.com/itcarroll/handouts.git
git push -u origin master

Go back to your GitHub account and check out your “hub”.

README.md is a Markdown file giving basic information about the repository.
There is a list of files, including a folder for data.
You are looking at a branch called master.
The commit history is available from the top bar.
The “Clone or download” button provides a URL.

In addition to being the center point for sharing commits with collaborators, GitHub is a rich platform for managing projects and inspecting the history.

Top of Section

Syncing Repos

Image by Atlassian / CC BY

The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.

Image by Atlassian / CC BY

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.

A git repository is a network of commits, although the current network is a tree with just one branch. After a worker creates a clone, the local repo is in the same state as the origin.

Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.

Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.

Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.

Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.

Image by Atlassian / CC BY

An essential component of the centralized workflow is the ability to merge commit histories that have diverged. Any fork in the history has to be re-integrated, and git does this automatically through merging.

Image by Atlassian / CC BY

The origin will not accept a push before merging. In order to preserve integrity, the contributor is always responsible for overseeing the merge on a local clone.

Image by Atlassian / CC BY

Top of Section

Collaborators

Collaboration that goes beyond commenting on a final report—integrated work on a project from start to finish—raises workflow challenges.

Data, script, or report; who has the most up-to-date version?
Will a collaborator’s work overwrite your own?
How to recover a working version of a broken pipeline?

Centralized workflows, managed by git, help to answer these questions.

Project Integrity

The origin becomes the official up-to-date repo, even if your work is a few commits ahead.
Diverging files are usually automatically merged by git.
Manual re-integration is aided by the ability to “checkout” the project at any commit.

Version control software works well with text files. Large, non-text components of your project (e.g. very big or binary data files) can slow down any cloning, merging or branch switching. For that reason, data rarely live in a repository with code and. Keeping data and code separate also facilitates data reuse—it’s not tied to one pipeline.

Add a section where you can list collaborators to the README.md file. Our aim is to let your collaborators update this list with their own name, so only include yourself. You can use any text editor, and RStudio’s is quite handy.

## Collaborators

- Ian Carroll

Stage

Before you can commit changes involving a new file, you have to tell git which modifications you want to commit by staging.

Go to the “Git” tab in RStudio.
Select “Commit” to open the “Review Changes” window.
Select “Staged” to add modifications (hence “M”) by “README.md”.

Commit

Enter a brief (<50 chars) descriptive message about the commit.
Commit!
Close the “Review Changes” window.

Push

Look at the “Git” tab again and notice that your branch is “ahead of origin/master”. Push the commit to your GitHub repo.

GitHub Collaborators

Even on public GitHub repos, only the owner has “push access” by default. The owner can allow any other GitHub user to push by inviting collaborators under the settings tab.

Introduce yourself to your neighbor and assign the two roles below. Each of you should watch the other perform their assigned steps.

Owner: add your neighbor as a collaborator.
Collaborator: accept your neighbor’s emailed invitation.

Clone

Collaborator: Create a new RStudio project from “Version Control”, using any available project directory name.
Owner: Verify two commits by you in your repo history on GitHub.

Push & Pull

Collaborator: Add your name to the list in the “README.md”.
Collaborator: Stage, commit, and push your modifications.
Owner: Pull (the down arrow) to apply your neighbor’s commit.

Merge

You both realize it would have been good to include your affiliation along with your name. Do you need to circulate “README.md” to each collaborator in sequence for an update? No!

Ower AND Collaborator: edit your entry in the “README.md”
Ower AND Collaborator: stage, commit, & push.
Owner OR Collaborator: if you receive an error message, it tells you exactly what to do.

Top of Section

What about Data?

The scripts tha execute your pipeline are plain text files, but the project may include other file types for figures, and possibly even some data sets.

Non-text files get little benefit from git, and have large costs.
Large data files should be un-tracked, or live elsewhere (as an “integration”)

External Data

The most common pipeline integration is shared data storage.

Local area network file share (e.g. Z:\\…)
Cloud storeage (e.g. Dropbox, Google Drive)
Database (e.g. a PostgreSQL server)

Link to the Data

One good practice is creating “symbolic links” (a.k.a. shortcuts) to data files that live outside a project repo, that work when your code looks inside the repo for data.

file.symlink(
  from = ...,
  to = 'data'
)

The shortcut works like a normal path to your data—you could easilly add all your data to a commit by accident with git add .. To avoid this, tell git to “ignore” all files and folders below data/.

/data/**

The leading / refers to the root of the git repository, not to the root of your filesystem.

Top of Section

A Plug for Reproducibility

Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.

The principle equally applies to modeling and data analysis.

Hallmarks of reproducible research

Reviewable	All details of the method used are easily accessible for peer review and community review.
Auditable	Records exist to document how the methods and conclusions evolved, but may be private.
Replicable	Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.
Open	The orginator grants permissions for reuse and extension of the research products.

Let your workflow help

Reviewable	Thoroughly-comment scripts and share continuously with collaborators
Auditable	Maintain project history to correct mistakes when necessary
Replicable	Provide “one-click” file & data sharing, of a streamlined analysis “pipeline”
Open	Publically release on GitHub (or similar) with (implied) open licensing

Top of Section

Future Directions

Share your work for reuse and extension.
Make trying new analysis as easy as branching.
Contribute beyond your own projects.

The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also available for review and extension by your research community.

Using advanced git to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, on separate branches if necessary. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.

The latest software for modeling and analysis in your research field may already be on git. Build better pipelines by contributing bug reports or even pull requests to projects integral to your own work.

Top of Section

If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!