Collaborative & Reproducible Workflows
Lesson 3a with Ian Carroll
Contents
- Objectives for this lesson
- Centralized Workflow
- A Plug for Reproducible Research
- Reproducible research: the end result
- What’s a GitHub? What’s a “repo”?
- Get Started with GitHub
- RStudio Projects
- Create a new file
- Working with Collaborators
- Where to from here?
Objectives for this lesson
- Learn about centralized workflows
- Identify attributes of reproducible research
- See how RStudio + git facilitates collaboration
Specific achievements
- Create a
git
repository on GitHub - Make commits to a project file
- Synchronize to a repository with RStudio
- Collaborate with another GitHub user
Overview
As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and – worst of all – become conflicted or lost.
This lesson gives a high level strategy to organize your workflow and introduces an accompanying software solution, git
.
Centralized Workflow
One strategy for distributed work among a team of scientists—the centralized workflow—dominates collaborative research. There is a central hub that stores all the project files. A number of researchers are spokes around that hub and work independently on private copies of the project. The integrety of the project is maintained by rules, enforced by the hub, for synchronizing between hub and spokes.
A Plug for Reproducible Research
Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.
Does the same principle apply to modeling and data analysis? You bet!
Hallmarks of reproducible research:
Reviewable | All details of the method used are easily accessible for peer review and community review. |
Auditable | Records exist to document how the methods and conclusions evolved, but may be private. |
Replicable | Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork. |
Open | The orginator grants permissions for reuse and extension of the research products. |
Use your workflow to achieve these same goals:
Reviewable | Write-ups and thoroughly-commented scripts shared among collaborators |
Auditable | Versioned project history, used to revert mistakes when necessary |
Replicable | “One-click” file & data sharing, as well as streamlined recreation of analyses |
Open | GitHub (or similar) based centralized workflow |
Reproducible research: the end result
A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science who publishes clear & reproducible workflows. The work leading up to “Pretty Darn Good Control” is on GitHub, a website integrated with the git
version control system.
Integrating git
with cloud services like GitHub creates a complete system for project management, collaboration and sharing.
The “pipeline” analogy
Credit: Philip Guo
What’s a GitHub? What’s a “repo”?
Open up the repository that provides the “handouts” for this workshop.
README.md
is a Markdown file giving basic information about the repository.- There is a list of files, including a folder for data.
- You are looking at a branch called
master
. - The commit history is available from the top bar.
- The “Clone or download” button provides a URL.
Centralized Workflow
The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.
Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.
A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.
A push copies local commits to the origin and applies them remotely.
A push copies local commits to the origin and applies them remotely.
Get Started with GitHub
Sign in or create a GitHub account.
Create a GitHub repository
Create a new repository on your GitHub page, name it whatever you like, but do check the box to create a “README”.
Once it’s created, find the “Clone or download” URL beginning with “https://”. We’ll need that later.
README.md
To begin making this project your own, modify the README. Tell us something about why you’re here!
# SESYNC Spatial ABM Course
Goals
- Learn to build Spatial ABMs!
The “.md” extension stands for “markdown”, which is a syntax for simple plain text “formatting”. Add a commit message when you save.
RStudio Projects
This software is an example of an integrated development environment and focuses on
- Editing scripts written in the R language.
- Running R language commands or programs in the R interpreter.
- Helping to manage many components of a collaborative project using version control.
Integration with git
RStudio provides convenient access to the core tools provided by git
, so any project can also be a repository.
Under the File menu, create a new project from a remote version control repository.
Files under version control
Software is written in plain text, and version control is design for software development. A scripted workflow relies heavilly on plain text files, but may include different file types for figures or data.
For this reason, a plain text editor is a core element of the IDE. The editor in RStudio is good for any kind of text documents: you could edit R scripts, NetLogo models, LaTeX documents, or even CSV files.
Configure git
The system
function sends a given string directly to the operating system, which uses the git
program itself to do something we can’t do through RStudio.
system("git config user.name '<Full Name>'")
system("git config user.email '<email>'")
Create a new file
Copy the .nlogo
file saved from the NetLogo Programming lesson into this directory:
turtles-own [energy]
...
@#$#@#$#@
GRAPHICS-WINDOW
Track it with git
Before you can commit changes involving a new file, you have to tell the version control system (that’s git
!) to watch it..
- Go to the
git
tab in RStudio - Select
commit
to open the “Review Changes” window - Select “Staged” to add (hence “A”) the new file.
- Enter a descriptive message about the commit.
- Commit!
Push
Open the “Review Changes” window again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.
Challenge
- Enter a description into the “WHAT IS IT?” section of the NetLogo model.
- Commit and push your work.
Working with Collaborators
True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges.
- Be it data, a script, or a write-up, who has the most up-to-date version?
- Will a teammate’s work overwrite any of your own?
- How do I recover the working version of code the PI broke?
A centralized workflow, managed by git
, helps answer these questions.
Centralized Workflow
- The origin becomes the official up-to-date repo, even if your work is a few commits ahead.
- Diverging projects are easily reintegrated with a merge algorithm.
- The complete project history is available to checkout.
Note, version control works really well with text. Non-textual components of your project (e.g. spatial data) need advanced treatment.
The first step to collaborative workflows is granting access to the origin of your project.
Introduce yourself to your neighbour, and ask for his/her GitHub username.
Add your neighbour as a collaborator, and accept your neighbours invitation to collaborate!
Editing on GitHub
Edit the README.md from your neighbour’s repo, by adding more goals to their README.
# SESYNC Spatial ABMs Course
Goals
- Learn to build Spatial ABMs!
- Work with a team.
Always write a meaningful commit message when you save!
Challenge
- Create a new RStudio project from your neighbour’s repository.
- Add a comment to explain what part of the code does.
- Commit and push your work.
Where to from here?
The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.
Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.
Using advanced git
to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.