Collaborative & Reproducible Workflows

Lesson 1 with Ian Carroll

This lesson gives a high level overview of workflows to organize your project and introduces an accompanying software solution, git.

Top of Section


Workflow diagram


Credit: Philip Guo

A real research workflow

A professor at UC Berkeley, Carl Boettiger, is a strong advocate for open science who publishes clear & reproducible workflows. The work leading up to “Pretty Darn Good Control” is on GitHub, a website integrated with the git version control system.

Integrating git with cloud services like GitHub creates a complete system for project management, collaboration and sharing.

Top of Section


A single collaboration model – the centralized workflow – dominates collaborative research. There is a central hub, and everyone synchronizes their work to it. A number of researchers are nodes – consumers of that hub – and synchronize to that one place.

Top of Section


Objectives for this lesson

Specific achievements

Top of Section


A Plug for Reproducible Research

Reproducibility is a core tenent of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.

Does the same principle apply to data analysis? You bet!

Hallmarks of reproducible research:

Reviewable All details of the method used are easily accessible for peer review and community review.
Auditable Records exist to document how the methods and conclusions evolved, but may be private.
Replicable Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.
Open The orginator grants permissions for reuse and extension of the research products.

Use your workflow to achieve these same goals:

Reviewable Write-ups and thoroughly-commented scripts shared among collaborators
Auditable Versioned project history, used to revert mistakes when necessary
Replicable “One-click” file & data sharing, as well as streamlined recreation of analyses
Open GitHub (or similar) based centralized workflow

Also, there’s this …

Top of Section


What’s a GitHub? What’s a “repo”?

Open up the repository that provides the “handouts” for this workshop.

Centralized Workflow


Image by Atlassian / CC BY

The origin is the central repository, in this case it lives on GitHub. Every member of the team gets a local copy of the entire project, called a clone.


Image by Atlassian / CC BY

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.

A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is in the same place as the origin.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

Top of Section


RStudio Projects

This software is an example of an integrated development environment and focuses on

  1. Creating projects that use the R programming language, and
  2. Running R language commands or programs in the R interpreter.

R is both a language and an interpreter.

Integration with git

RStudio provides convenient access to the core tools provided by git, so any project can also be a repository.

Under the File menu, create a new project from a remote version control repository.

Files under version control

Software is written in plain text, and version control is design for software development. A scripted workflow relies heavilly on plain text files, but may include different file types for figures or data.

For this reason, a plain text editor is a core element of the IDE. The editor in RStudio is good for any kind of text documents: you could edit R scripts, C++ code, LaTeX documents, or even CSV files.

README.md

To begin making this project your own, modify the README. Tell us something about why you’re here!

# SESYNC Computational Synthesis Institute

Goals

- ...

The “.md” extension stands for “markdown”, which is a syntax for simple plain text “formatting”.

Top of Section


Save your work

Do save the README.md file.

But I mean really save your work, by commiting it to your project with a version control system (that’s git!).

  1. Go to the git tab in RStudio
  2. Select commit to open the “Review Changes” window
  3. Select the file(s) you want to commit.
  4. Enter a descriptive message about the commit.
  5. Commit!

Create a GitHub repository

Create a new repository on your GitHub page, name it whatever you like, but leave it empty (no README!).

Once it’s created, find the “Clone or download” URL beginning with “https://”.

Configure git

The system function in lesson-1.R sends the string directly to the operating system, which uses the git program itself to do something we can’t do through RStudio.

# Configure git

system("git config --global user.name "Ian Carroll")
system("git config --global user.name icarroll@sesync.org)
Question
Why did I put spaces around my name but not my e-mail
Answer
A space usually means the end of a string, the quotes are an alternative way to demarcate the bounds of a string.

Change origin repo

Open “Tools” > “Project Options” > “Git/SVN”, notice that the orign is the SESYNC-CI organization’s URL.

# Set a new origin URL

system("git remote set-url origin https://github.com/%username%/%repo%")

Push

Open the “Review Changes” window again and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.

Top of Section


Where to from here?

The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also availabe for review and extension by your research community.

Using git to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, even on the same documents. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.

Sharing project files, including managing multiple “copies” during development or at public release, in a hub-and-spokes workflow is a streamlined cloning process. The origin always has the “most recent” version of any documents: conflicts must be resolved in the local clone before new commits can be shared.

Top of Section