Collaborative & Reproducible Workflows
Instructor: Ian Carroll
As your research project moves from conception, through data collection and analysis, to reporting and other forms of dissemination, the many components can fracture, lose their development history, and – worst of all – become conflicted.
This lesson gives a high level overview of workflows to organize your project and introduces an accompanying software solution, git
.
Workflows
Credit: Philip Guo
Distributed Workflows
A single collaboration model – the centralized workflow – dominates collaborative research. There is a central hub, and everyone synchronizes their work to it. A number of researchers are nodes – consumers of that hub – and synchronize to that one place.
Objectives for this lesson
- Review key challenges for collaborative projects
- Identify attributes of reproducible research
- Learn about a framework for distributed workflows
Specific achievements
- Create a repository on GitHub
- Manage repositories using
git
- Publish your changes to the
README.md
file - Work with a collaborator on GitHub
Key Challenges
- Share files & data
- Work in parallel
- Keep all particpants up-to-date
- Avoid duplicating effort
- Check or repeat collaborator’s work
- Save & recover previous versions
- Others?
What is reproducible research?
Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.
Does the same principle apply to data analysis? You bet!
Hallmarks of reproducible research
Reviewable | All details of the method used are easily accessible for peer review and community review. |
Auditable | Records exist to document how the methods and conclusions evolved, but may be private. |
Replicable | Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork. |
Open | The orginator grants permissions for reuse and extension of the research products. |
Striving towards these goals has practical benefits, touching on many of the challenges just identified.
- Reviewable ⬄ write-ups and thoroughly-commented scripts shared with collaborators
- Auditable ⬄ versioned work, ability to revert mistakes
- Replicable ⬄ “one-click” file & data sharing
- Open ⬄ GitHub (or similar) based centralized workflow
Create a GitHub Repository
Create a new “test” repository at https://github.com/%username%, initializing the repo with a “README.md”.
What’s in a “repo”?
- Note that you are looking at a branch called
master
. - There is one commit.
- README.md is a Markdown file.
Centralized Workflow
The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.
A worker pulls contributions from other teammates from the origin when ready, and she pushes updates to the origin when ready to share her own work.
A commit is a unit of work: any collection of changes to files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.
A pull applies commits copied from the origin to your local repo, syncing them up.
A push copies local commits to the origin and applies them remotely.
Let’s git
Going!
The namesake of GitHub is the command-line-utility git
. It performs the clone, push and pull procedures just described, and many more. Let’s begin by doing some basic configuration.
git config --global user.name "%name%"
git config --global user.email %email%
git config --global push.default simple
Clone your repository
cd %sandbox%
git clone https://github.com/%username%/test.git
Cloning into 'test'...
remote: Counting objects: 3, done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
Checking connectivity... done.
Hint: copy your repo URL from right here.
cd test
git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
Edit README.md
- Open README.md in any text editor (e.g. notepad or emacs)
- Add any information the public should see when viewing your repository.
- Use Markdown syntax for very basic formatting.
# Welcome to My Project
This project includes the following:
+ nothing
+ zero
+ nada
Now, check the result of making changes to your repo.
git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: README.md
no changes added to commit (use "git add" and/or "git commit -a")
Commit your changes with a descriptive but short commit message.
git add .
git commit -m "embellish README.md"
[master %hash%] updates
1 file changed, 8 insertions(+), 1 deletion(-)
Push your commit back to origin.
git push
Username for 'https://github.com': %username%
Password for 'https://%username%@github.com':
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/itcarroll/test.git
%hash%..%hash% master -> master
Now go check out your README.md on GitHub!
Working with Collaborators
True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges. Be it data, a script, or a write-up, who has the most up-to-date version? Will a teammate’s work overwrite any of your own? How do I recover the working version of code the PI broke?
A centralized workflow, managed by git
, helps answer these questions.
- The origin becomes the official up-to-date repo, even if you’re work is a few commits ahead.
- Diverging workflows are reintegrated with a merge.
- The complete project history is available to checkout.
What’s the catch? It only works with text. We’ll address non-textual components of your project later on.
The first step to collaborative workflows is granting access to the origin of your project.
Introduce yourself to your neighbour, and ask for his/her GitHub username.
Add your neighbour as a collaborator, and clone his/her test repo.
cd %sandbox%
git clone https://github.com/%not-my-username%/test.git not-my-test
cd not-my-test
Edit the README.md from your neighbour’s repo, by adding a fourth bullet point.
# Welcome to My Project
This project includes the following:
+ nothing
+ zero
+ nada
+ %your bullet point%
Now do a commit & push. Note that we have tucked the “add” step into the commit with the argument “a”.
git commit -am 'Add important bullet to README.md'
[master %hash%] amazing
1 file changed, 1 insertion(+), 1 deletion(-)
git push
Username for 'https://github.com': %username%
Password for 'https://%username%@github.com':
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 353 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/itcarroll/test.git
%hash%..%hash% master -> master
The origin for your local “test” repo now has a commit you don’t have locally – the one your collaborator pushed. Let’s compound the problem by adding a local commit that your origin doesn’t have. Change the title of your project in README.md in your local test repo.
# Welcome to My *Amazing* Project
This project includes the following:
+ nothing
+ zero
+ nada
Try the usual routine: commit & push.
cd %sandbox%
cd test
git commit -am 'amazing'
[master %hash%] amazing
1 file changed, 1 insertion(+), 1 deletion(-)
git push
To https://github.com/%username%/test.git
! [rejected] master -> master (fetch first)
error: failed to push some refs to 'https://github.com/%username%/test.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
Take a moment to read the message – it gives a good explanation of what just happened.
Take the Hint!
git pull
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/%username%/test
15bc488..26c2dcd master -> origin/master
Auto-merging README.md
Merge made by the 'recursive' strategy.
README.md | 1 +
1 file changed, 1 insertion(+)
The message tells you about any changes made by this merge commit, which seamlessly integrates changes to the same file by multiple authors.
Merge Commits
Checkout the Log
Version control gives you access to the state of the repository at any commit. To view the history, look at the log.
cd %sandbox%
cd test
git log
commit 0517b3b2258e6cce76770646f175dc8abfe9e148
Merge: 8612809 26c2dcd
Author: Ian Carroll <icarroll@sesync.org>
Date: Tue Jul 26 14:53:22 2016 -0400
Merge branch 'master' of https://github.com/itcarroll/test
commit 8612809b6eeea263a853783cf4c37a6862a31d22
Author: Ian Carroll <icarroll@sesync.org>
Date: Tue Jul 26 13:48:57 2016 -0400
amazing
See how helpful a concise & descriptive commit messages would be?
Let’s investigate a commit we are not so sure about.
git show 8612
commit 8612809b6eeea263a853783cf4c37a6862a31d22
Author: Ian Carroll <icarroll@sesync.org>
Date: Fri Jun 24 13:48:57 2016 -0400
amazing
diff --git a/README.md b/README.md
index 521cb5d..24a865d 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Welcome to My Project
+# Welcome to My Amazing Project
This project does the following:
+ nothing
git revert --no-edit 8612
[master b0aaef0] Revert "amazing"
1 file changed, 1 insertion(+), 1 deletion(-)
Summary
The test repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.
Using git
to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, Work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the log.
Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.