Collaborative Workflows & Reproducible Pipelines

Lesson 2 with Ian Carroll

Contents


Objectives for this lesson

Specific achievements

Overview

As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and—worst of all—become conflicted or lost.

This lesson gives a high level strategy to organize your collaborative workflow and introduces accompanying software and cloud solutions.

Top of Section


Centralized Workflow

One strategy for distributed work among a team of scientists—the centralized workflow—dominates collaborative research.

A central hub stores project files and their history. Researchers are spokes on the wheel, working on private copies of the project. Project integrity is maintained by rules, enforced at the hub, for synchronizing between hub and spokes.

Top of Section


A Plug for Reproducible Research

Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.

Does the same principle apply to modeling and data analysis? You bet!

Hallmarks of reproducible research:

Reviewable All details of the method used are easily accessible for peer review and community review.
Auditable Records exist to document how the methods and conclusions evolved, but may be private.
Replicable Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.
Open The orginator grants permissions for reuse and extension of the research products.

Let your workflow help achieve these same goals:

Thoroughly-comment scripts and share continusously with collaborators Reviewable
Maintain project history to correct mistakes when necessary Auditable
Provide “one-click” file & data sharing, of a streamlined analysis “pipeline” Replicable
Publically release on GitHub (or similar) with (implied) open licensing Open

Top of Section


Reproducible research: the end result

A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science who publishes clear & reproducible pipelines. The work leading up to the paper “Pretty Darn Good Control” is on GitHub, a website integrated with the git version control system.

Integrating git with cloud services like GitHub creates a complete system for project management, collaboration and sharing.

Using collaborative workflows …


Credit: Philip Guo

… to construct a single pipeline


Credit: Philip Guo

Top of Section


What’s a GitHub? What’s a “repo”?

Open up the repository that provides the “handouts” for this workshop.

Centralized Workflow


Image by Atlassian / CC BY

The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.


Image by Atlassian / CC BY

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.

A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

Top of Section


Get Started with GitHub

Sign in or create a GitHub account.

Create a GitHub repository

Create a new repository on your GitHub page.

  1. Name the repository “handouts”.
  2. Add a short “tag line” for your Summer Institute experience.
  3. Leave all the boxes (includeing the “README”) un-checked.

Empty repository

You have created an empty repository. The quick start information is emphasizing that you’ll need the URL beginning with “git@github…”. We’ll get to that in a minute.

Top of Section


RStudio Projects

This software is an example of an integrated development environment and focuses on

  1. Editing scripts written in the R language.
  2. Running R language commands or programs in the R interpreter.
  3. Helping to manage many components of a collaborative project using version control.

Integration with git

RStudio provides convenient access to the core tools provided by git, so any project can also be a repository. Under the File menu, create a new project from a remote version control repository.

Configure git

Every commit has an author. For GitHub to attribute commits to your account, configure git with your GitHub username and associated e-mail address.

## Configure git

git config --global user.name ...
git config --global user.email ...

Configure your clone

The “handouts” repository is currently linked—via URL—to the “hub” you cloned from SESYNC’s repositories on GitHub. To transfer the repository to the newly created repository owned by you, set the URL to the one provided and push all the things.

## Change the "origin" remote URL and push

git remote set-url origin ...
git push --all

Save your worksheet-2.sh and select “Run Script” to execute these shell commands.

Files under version control

A scripted pipeline relies heavilly on plain text files (the scripts), but may include different file types for figures or data. Any file in this directory that is under version control is monitored for differences from the committed state of the project. Files must be added to at least one commit before they are tracked.

Commit & push

The first change you made to the handouts repository are your edits to worksheet-2.sh. You have saved them, but you haven’t committed them to the repository.

  1. On the git tab in RStudio, select commit
  2. Check the modifications to “Stage”
  3. Add a commit message
  4. Commit
  5. Push

Top of Section


Create a new file

Create a new text file in the RStudio editor as below, adding yourself as the first collaborator.

## Project Collaborators

- ...
- My neighbor!

In the final part of the lesson, we’ll have a project collaborator replace “My neighbor!” with his or her name.

Track it with git

Before you can commit changes involving a new file, you have to tell the version control system (that’s git!) to watch it..

  1. Go to the git tab in RStudio
  2. Select commit to open the “Review Changes” window
  3. Select “Staged” to add (hence “A”) the new file.
  4. Enter a descriptive message about the commit.
  5. Commit!

Push

Look at the git tab again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.

Exercise 1

Modify the collaborators.md file again to add a third “TBD” collaborator, and push the modification to the origin as one commit.

Top of Section


Working with Collaborators

True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges.

Centralized workflows, managed by git, helps answer these questions.

Project Integrity

Note, version control works really well with text. Non-textual components of your project (e.g. large or binary data) need advanced treatment.

The first step to collaborative workflows is granting access to the origin of your project. Introduce yourself to your neighbour, and ask for his/her GitHub username.

Add your neighbour as a collaborator, and accept your neighbours invitation to collaborate!

Editing on GitHub

Go to your neighbors repository on GitHub, and open collaborators.md. The text below shows “My Neighbor!” where you should see your neighbor’s name. Edit the file in your neighbour’s repo, by replacing the remaining “My neighbor!” with your own name.

## Project Collaborators

- My neighbor!
- ...
- TBD

Always write a meaningful commit message when you save!

Exercise 2

Create a new RStudio project from your neighbour’s repository. Note the name you choose during project creation in RStudio does not have to be “handouts”, i.e. it does not have to match the name of the repostitory on GitHub. Make further changes to the collaborators.md file, then commit & push.

Top of Section


Where to from here?

Top of Section