git and More Tools in the Shell

Lesson 5 with Ian Carroll

Contents


Centralized Workflow

As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and—worst of all—become conflicted or lost.

This lesson explains a high level strategy for organizing your collaborative workflow and introduces accompanying software and cloud solutions.

The strategy for distributed coding among a team of scientists—the centralized workflow—is widespread in collaborative research.

A central hub stores project files and their history. Researchers are spokes on the wheel, working on private copies of the project. Project integrity is maintained through rules enforced by the hub for synchronizing between hub and spokes.

Top of Section


Objectives for this lesson

Specific achievements

Top of Section


Git in the Shell

The namesake of GitHub is the command-line utility git. It performs the clone, push, pull, and merge procedures just mentioned, and many more.

The software has no GUI of it’s own, and works through commands always beginning with “git “ given in the shell. The comamnd to turn the “current folder” into a git repo is:

git init
Initialized empty Git repository in ~/handouts/.git/
Commit your changes with a descriptive but short commit message.

Add files to git’s watchlist with the “add” command

git add README.md
git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

        new file:   README.md

Untracked files:

“Commit” updates the added files in a newly labeled version of your project’s history.

git commit -m "initial commit"
*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: empty ident name (for <(null)>) not allowed

Every commit needs an author. Follow git’s instructions, using a real email address so your commits can be associated with your GitHub account, and try again.

git commit -m "initial commit"
[master (root-commit) <sha>] initial commit
 1 file changed, 10 insertions(+)
 create mode 100755 README.md
git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	CONTRIBUTING.md
	data/
	handouts.Rproj
	worksheet-1.R
	worksheet-2.R
	worksheet-3.R
	worksheet-4.R

nothing added to commit but untracked files present (use "git add" to track)

Checkout the Log

Version control gives you access to the state of the repository at any previous commit. View this history in the log.

git log
commit <sha>
Author: <author>
Date:   <datetime>

    initial commit

Exercise 1

Introduce a second commit that messes up your README.md or another file. Make sure it shows up in the log.

Revert

Let’s investigate the most recent commit.

git show
commit <sha>
Author: <author>
Date:   <datetime>

    <message>

<diff>

The , even just the first few digits at this stage, are unique to each commit. Use "revert" to undo the changes introduced in a specified commit.

git revert --no-edit <sha>
[master <sha>] Revert <message>
 1 file changed, 1 insertion(+), 1 deletion(-)

Top of Section


What’s a GitHub?


Image by Atlassian / CC BY

The origin is the central copy of the project, a repository that lives on GitHub. Every member of the team uses a local copy of the entire project, called a clone.


Image by Atlassian / CC BY

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to share her own work.

A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is viewing the same commit as the origin.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

Top of Section


Create a GitHub Repository

  1. Sign in or create a GitHub account.

  2. Create a new repository on your GitHub page.

  1. Give the repo a name
  2. Add a short “tag line” to jog your memory
  3. Leave the boxes (including the “README”) un-checked

Empty repository

You have created an empty repository. The quick start information provides clues on how to create your first commit.

Configure your clone

To push and pull from your local repo to GitHub, you must configure your local repo with the URL of the remote repo. By convention, we call the central copy the “origin”.

git remote add origin <URL>

Push your commit up to the origin.

git push
Username for 'https://github.com': <username>
Password for 'https://<username>@github.com': 
Counting objects: 9, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 2.10 KiB | 1.05 MiB/s, done.
Total 9 (delta 6), reused 0 (delta 0)
remote: Resolving deltas: 100% (6/6), completed with 6 local objects.
To <url>
   <sha>..<sha>  master -> master
Branch 'master' set up to track remote branch 'master' from 'origin'.Counting objects: 3, done.

Take a look at the repository on GitHub.

GitHub Editor

The online editor is good for quick-n-easy fixes, and for working on documentation. Its a bad place to modify code, because it’s not tested before reaching the origin. Nevertheless … try it out on README.md.

Top of Section


Merging

An essential component of the centralized workflow is the ability to merge commit histories that have diverged. Each fork in the log has to be re-integrated, and git does this automatically through merging.

git add worksheet*
git commit -m 'feel the learn'
[master <sha>] feel the learn
 5 files changed, 955 insertions(+)

Merge commits most commonly arise when a commit shows up on GitHub that isn’t in your local clone. Such as the current situation.


Image by Atlassian / [CC BY]

Even though these changes do not conflict, GitHub won’t allow you to push. Take a moment to read the message, it gives a good explanation of what has happened.

git push
To https://github.com/<username>/<repo>.git
 ! [rejected]        master -> master (fetch first)
 error: failed to push some refs to 'https://github.com/<username>/<repo>.git'
 hint: Updates were rejected because the remote contains work that you do
 hint: not have locally. This is usually caused by another repository pushing
 hint: to the same ref. You may want to first integrate the remote changes
 hint: (e.g., 'git pull ...') before pushing again.
 hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Take the Hint!

git pull
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/<username>/<repo>
   15bc488..26c2dcd  master     -> origin/master
   Auto-merging README.md
   Merge made by the 'recursive' strategy.
    README.md | 1 +
	 1 file changed, 1 insertion(+)

The message tells you about any changes made by this merge commit, which seamlessly integrates changes to the same file by multiple authors.

Top of Section


Files under version control

A scripted pipeline relies heavily on plain text files (the scripts), but may include different file types for figures or even some data.

Not under version control

The most common pipeline integration is shared data storage, which is either too large to include or not a flat file.

Path to shared data

The best practice is to create of shortcut to shared data files, and reference the local shortcut in code. For example, use the ln command in the shell to create a fake folder as a relative path inside your repository.

mkdir "~/Google Drive"
ln -s cloud "~/Google Drive/"

Ignore Exclusions

data/**
cloud/**

Top of Section


Working with Collaborators

True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges.

Centralized workflows, managed by git, help solve these challenges.

Project Integrity

Note, version control works really well with text. Non-textual components of your project (e.g. large or binary data) rarely live in a repository. Use cloud storage for more static files and a database for dynamic records.

Create a new file

Create a new text file as below, adding yourself as the first collaborator.

## Project Collaborators

- ...
- My neighbor!

Our aim is to let your project collaborator replace “My neighbor!” with his or her name.

Track it with git

Before you can commit changes involving a new file, you have to tell the version control system (that’s git!) to watch it..

git add collaborators.md
git commit -m 'just me so far!'

Push

Look at the git status and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.

Collaborate!

The first step to collaborative workflows is granting access to the origin of your project. Introduce yourself to your neighbor, and ask for his/her GitHub username.

Add your neighbor as a collaborator, and accept your neighbor’s invitation to collaborate!

As a collaborator on your neighbors repository, you have permission to edit their collaborators.md.

The text below shows “My Neighbor!” where you should see your neighbor’s name. Edit the file in your neighbor’s repo, by replacing the remaining “My neighbor!” with your own name.

## Project Collaborators

- My neighbor!
- ...

Write a meaningful commit message to save your work.

Integrate your Collaborator’s work

If you have no uncommited work in your tracked files, you can pull down the new commit from your neighbor and “fast-forward” your project.

git pull

Top of Section


Share and Share Back

Top of Section