Collaborative & Reproducible Workflows

Instructor: Ian Carroll

As your research project moves from conception, through data collection and analysis, to reporting and other forms of dissemination, the many components can fracture, lose their development history, and – worst of all – become conflicted.

This lesson gives a high level overview of workflows to organize your project and introduces an accompanying software solution, git.

Top of Section


Workflows


Credit: Philip Guo

Top of Section


Distributed Workflows

A single collaboration model – the centralized workflow – dominates collaborative research. There is a central hub, and everyone synchronizes their work to it. A number of researchers are nodes – consumers of that hub – and synchronize to that one place.

Top of Section


Objectives for this lesson

Specific achievements

Top of Section


Key Challenges

Top of Section


What is reproducible research?

Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.

Does the same principle apply to data analysis? You bet!

Hallmarks of reproducible research

Reviewable All details of the method used are easily accessible for peer review and community review.
Auditable Records exist to document how the methods and conclusions evolved, but may be private.
Replicable Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.
Open The orginator grants permissions for reuse and extension of the research products.

Striving towards these goals has practical benefits, touching on many of the challenges just identified.

Top of Section


Create a GitHub Repository

Create a new “test” repository at https://github.com/%username%, initializing the repo with a “README.md”.

What’s in a “repo”?

Centralized Workflow


Image by Atlassian / CC BY

The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.


Image by Atlassian / CC BY

A worker pulls contributions from other teammates from the origin when ready, and she pushes updates to the origin when ready to share her own work.

A commit is a unit of work: any collection of changes to files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.


Image by Atlassian / CC BY

A pull applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

Top of Section


Let’s git Going!

The namesake of GitHub is the command-line-utility git. It performs the clone, push and pull procedures just described, and many more. Let’s begin by doing some basic configuration.

git config --global user.name "%name%"
git config --global user.email %email%
git config --global push.default simple

Clone your repository

cd %sandbox%
git clone https://github.com/%username%/test.git
Cloning into 'test'...
remote: Counting objects: 3, done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
Checking connectivity... done.


Hint: copy your repo URL from right here.

cd test
git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

Edit README.md

# Welcome to My Project

This project includes the following:
+ nothing
+ zero
+ nada

Now, check the result of making changes to your repo.

git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
	(use "git add <file>..." to update what will be committed)
	(use "git checkout -- <file>..." to discard changes in working directory)
	
		modified:   README.md
		
no changes added to commit (use "git add" and/or "git commit -a")		

Commit your changes with a descriptive but short commit message.

git add .
git commit -m "embellish README.md"
[master %hash%] updates
 1 file changed, 8 insertions(+), 1 deletion(-)

Push your commit back to origin.

git push
Username for 'https://github.com': %username%
Password for 'https://%username%@github.com': 
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/itcarroll/test.git
   %hash%..%hash%  master -> master

Now go check out your README.md on GitHub!

Top of Section


Working with Collaborators

True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges. Be it data, a script, or a write-up, who has the most up-to-date version? Will a teammate’s work overwrite any of your own? How do I recover the working version of code the PI broke?

A centralized workflow, managed by git, helps answer these questions.

What’s the catch? It only works with text. We’ll address non-textual components of your project later on.

The first step to collaborative workflows is granting access to the origin of your project.

Introduce yourself to your neighbour, and ask for his/her GitHub username.

Add your neighbour as a collaborator, and clone his/her test repo.

cd %sandbox%
git clone https://github.com/%not-my-username%/test.git not-my-test
cd not-my-test

Edit the README.md from your neighbour’s repo, by adding a fourth bullet point.

# Welcome to My Project

This project includes the following:
+ nothing
+ zero
+ nada
+ %your bullet point%

Now do a commit & push. Note that we have tucked the “add” step into the commit with the argument “a”.

git commit -am 'Add important bullet to README.md'
[master %hash%] amazing
 1 file changed, 1 insertion(+), 1 deletion(-)
git push
Username for 'https://github.com': %username%
Password for 'https://%username%@github.com': 
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/itcarroll/test.git
   %hash%..%hash%  master -> master

The origin for your local “test” repo now has a commit you don’t have locally – the one your collaborator pushed. Let’s compound the problem by adding a local commit that your origin doesn’t have. Change the title of your project in README.md in your local test repo.

# Welcome to My *Amazing* Project

This project includes the following:
+ nothing
+ zero
+ nada

Try the usual routine: commit & push.

cd %sandbox%
cd test
git commit -am 'amazing'
[master %hash%] amazing
 1 file changed, 1 insertion(+), 1 deletion(-)
git push
To https://github.com/%username%/test.git
 ! [rejected]        master -> master (fetch first)
 error: failed to push some refs to 'https://github.com/%username%/test.git'
 hint: Updates were rejected because the remote contains work that you do
 hint: not have locally. This is usually caused by another repository pushing
 hint: to the same ref. You may want to first integrate the remote changes
 hint: (e.g., 'git pull ...') before pushing again.
 hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Take a moment to read the message – it gives a good explanation of what just happened.

Take the Hint!

git pull
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/%username%/test
   15bc488..26c2dcd  master     -> origin/master
   Auto-merging README.md
   Merge made by the 'recursive' strategy.
    README.md | 1 +
	 1 file changed, 1 insertion(+)

The message tells you about any changes made by this merge commit, which seamlessly integrates changes to the same file by multiple authors.

Merge Commits


Image by Atlassian / CC BY

Checkout the Log

Version control gives you access to the state of the repository at any commit. To view the history, look at the log.

cd %sandbox%
cd test
git log
commit 0517b3b2258e6cce76770646f175dc8abfe9e148
Merge: 8612809 26c2dcd
Author: Ian Carroll <icarroll@sesync.org>
Date:   Tue Jul 26 14:53:22 2016 -0400

    Merge branch 'master' of https://github.com/itcarroll/test
	
commit 8612809b6eeea263a853783cf4c37a6862a31d22
Author: Ian Carroll <icarroll@sesync.org>
Date:   Tue Jul 26 13:48:57 2016 -0400

    amazing

See how helpful a concise & descriptive commit messages would be?

Let’s investigate a commit we are not so sure about.

git show 8612
commit 8612809b6eeea263a853783cf4c37a6862a31d22
Author: Ian Carroll <icarroll@sesync.org>
Date:   Fri Jun 24 13:48:57 2016 -0400

    amazing
	
diff --git a/README.md b/README.md
index 521cb5d..24a865d 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Welcome to My Project
+# Welcome to My Amazing Project

This project does the following:
 + nothing
git revert --no-edit 8612
[master b0aaef0] Revert "amazing"
 1 file changed, 1 insertion(+), 1 deletion(-)

Top of Section


Summary

The test repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.

Using git to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, Work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the log.

Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.

Top of Section