Collaborative & Reproducible Workflows

Lesson 3a with Ian Carroll

Contents


Objectives for this lesson

Specific achievements

Overview

As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and – worst of all – become conflicted or lost.

This lesson gives a high level strategy to organize your workflow and introduces an accompanying software solution, git.

Top of Section


Centralized Workflow

One strategy for distributed work among a team of scientists—the centralized workflow—dominates collaborative research. There is a central hub that stores all the project files. A number of researchers are spokes around that hub and work independently on private copies of the project. The integrety of the project is maintained by rules, enforced by the hub, for synchronizing between hub and spokes.

Top of Section


A Plug for Reproducible Research

Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.

Does the same principle apply to modeling and data analysis? You bet!

Hallmarks of reproducible research:

Reviewable All details of the method used are easily accessible for peer review and community review.
Auditable Records exist to document how the methods and conclusions evolved, but may be private.
Replicable Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.
Open The orginator grants permissions for reuse and extension of the research products.

Use your workflow to achieve these same goals:

Reviewable Write-ups and thoroughly-commented scripts shared among collaborators
Auditable Versioned project history, used to revert mistakes when necessary
Replicable “One-click” file & data sharing, as well as streamlined recreation of analyses
Open GitHub (or similar) based centralized workflow

Top of Section


Reproducible research: the end result

A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science who publishes clear & reproducible workflows. The work leading up to “Pretty Darn Good Control” is on GitHub, a website integrated with the git version control system.

Integrating git with cloud services like GitHub creates a complete system for project management, collaboration and sharing.

The “pipeline” analogy


Credit: Philip Guo

Top of Section


What’s a GitHub? What’s a “repo”?

Open up the repository that provides the “handouts” for this workshop.

Centralized Workflow


Image by Atlassian / CC BY

The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.


Image by Atlassian / CC BY

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.

A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

Top of Section


Get Started with GitHub

Sign in or create a GitHub account.

Create a GitHub repository

Create a new repository on your GitHub page, name it whatever you like, but do check the box to create a “README”.

Once it’s created, find the “Clone or download” URL beginning with “https://”. We’ll need that later.

README.md

To begin making this project your own, modify the README. Tell us something about why you’re here!

# SESYNC Spatial ABM Course

Goals

- Learn to build Spatial ABMs!

The “.md” extension stands for “markdown”, which is a syntax for simple plain text “formatting”. Add a commit message when you save.

Top of Section


RStudio Projects

This software is an example of an integrated development environment and focuses on

  1. Editing scripts written in the R language.
  2. Running R language commands or programs in the R interpreter.
  3. Helping to manage many components of a collaborative project using version control.

Integration with git

RStudio provides convenient access to the core tools provided by git, so any project can also be a repository.

Under the File menu, create a new project from a remote version control repository.

Files under version control

Software is written in plain text, and version control is design for software development. A scripted workflow relies heavilly on plain text files, but may include different file types for figures or data.

For this reason, a plain text editor is a core element of the IDE. The editor in RStudio is good for any kind of text documents: you could edit R scripts, NetLogo models, LaTeX documents, or even CSV files.

Configure git

The system function sends a given string directly to the operating system, which uses the git program itself to do something we can’t do through RStudio.

system("git config user.name '<Full Name>'")
system("git config user.email '<email>'")

Top of Section


Create a new file

Copy the .nlogo file saved from the NetLogo Programming lesson into this directory:

turtles-own [energy]

...

@#$#@#$#@
GRAPHICS-WINDOW

Track it with git

Before you can commit changes involving a new file, you have to tell the version control system (that’s git!) to watch it..

  1. Go to the git tab in RStudio
  2. Select commit to open the “Review Changes” window
  3. Select “Staged” to add (hence “A”) the new file.
  4. Enter a descriptive message about the commit.
  5. Commit!

Push

Open the “Review Changes” window again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.

Challenge

  1. Enter a description into the “WHAT IS IT?” section of the NetLogo model.
  2. Commit and push your work.

Top of Section


Working with Collaborators

True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges.

A centralized workflow, managed by git, helps answer these questions.

Centralized Workflow

Note, version control works really well with text. Non-textual components of your project (e.g. spatial data) need advanced treatment.

The first step to collaborative workflows is granting access to the origin of your project.

Introduce yourself to your neighbour, and ask for his/her GitHub username.

Add your neighbour as a collaborator, and accept your neighbours invitation to collaborate!

Editing on GitHub

Edit the README.md from your neighbour’s repo, by adding more goals to their README.

# SESYNC Spatial ABMs Course

Goals

- Learn to build Spatial ABMs!
- Work with a team.

Always write a meaningful commit message when you save!

Challenge

  1. Create a new RStudio project from your neighbour’s repository.
  2. Add a comment to explain what part of the code does.
  3. Commit and push your work.

Top of Section


Where to from here?

The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.

Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.

Using advanced git to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.

Top of Section