git and More Tools in the Shell
Lesson 5 with Ian Carroll
Contents
- Centralized Workflow
- Objectives for this lesson
- Git in the Shell
- What’s a GitHub?
- Create a GitHub Repository
- Merging
- Files under version control
- Working with Collaborators
- Share and Share Back
Centralized Workflow
As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and—worst of all—become conflicted or lost.
This lesson explains a high level strategy for organizing your collaborative workflow and introduces accompanying software and cloud solutions.
The strategy for distributed coding among a team of scientists—the centralized workflow—is widespread in collaborative research.
A central hub stores project files and their history. Researchers are spokes on the wheel, working on private copies of the project. Project integrity is maintained through rules enforced by the hub for synchronizing between hub and spokes.
Objectives for this lesson
- See what version control does
- Learn about centralized workflows
- Try out GitHub
Specific achievements
- Make “commits” to a project file with git
- “Push” and “pull” project work to GitHub
- “Merge” your work with a GitHub collaborator’s
Git in the Shell
The namesake of GitHub is the command-line utility git
. It performs
the clone, push, pull, and merge procedures just mentioned, and many
more.
The software has no GUI of it’s own, and works through commands always beginning with “git “ given in the shell. The comamnd to turn the “current folder” into a git repo is:
git init
Initialized empty Git repository in ~/handouts/.git/
Commit your changes with a descriptive but short commit message.
Add files to git’s watchlist with the “add” command
git add README.md
git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: README.md
Untracked files:
“Commit” updates the added files in a newly labeled version of your project’s history.
git commit -m "initial commit"
*** Please tell me who you are.
Run
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
to set your account's default identity.
Omit --global to set the identity only in this repository.
fatal: empty ident name (for <(null)>) not allowed
Every commit needs an author. Follow git’s instructions, using a real email address so your commits can be associated with your GitHub account, and try again.
git commit -m "initial commit"
[master (root-commit) <sha>] initial commit
1 file changed, 10 insertions(+)
create mode 100755 README.md
git status
On branch master
Untracked files:
(use "git add <file>..." to include in what will be committed)
CONTRIBUTING.md
data/
handouts.Rproj
worksheet-1.R
worksheet-2.R
worksheet-3.R
worksheet-4.R
nothing added to commit but untracked files present (use "git add" to track)
Checkout the Log
Version control gives you access to the state of the repository at any previous commit. View this history in the log.
git log
commit <sha>
Author: <author>
Date: <datetime>
initial commit
Exercise 1
Introduce a second commit that messes up your README.md or another file. Make sure it shows up in the log.
Revert
Let’s investigate the most recent commit.
git show
commit <sha>
Author: <author>
Date: <datetime>
<message>
<diff>
The
git revert --no-edit <sha>
[master <sha>] Revert <message>
1 file changed, 1 insertion(+), 1 deletion(-)
What’s a GitHub?
The origin is the central copy of the project, a repository that lives on GitHub. Every member of the team uses a local copy of the entire project, called a clone.
Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to share her own work.
A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is viewing the same commit as the origin.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up.
A push copies local commits to the origin and applies them remotely.
A push copies local commits to the origin and applies them remotely.
Create a GitHub Repository
-
Sign in or create a GitHub account.
-
Create a new repository on your GitHub page.
- Give the repo a name
- Add a short “tag line” to jog your memory
- Leave the boxes (including the “README”) un-checked
Empty repository
You have created an empty repository. The quick start information provides clues on how to create your first commit.
Configure your clone
To push and pull from your local repo to GitHub, you must configure your local repo with the URL of the remote repo. By convention, we call the central copy the “origin”.
git remote add origin <URL>
Push your commit up to the origin.
git push
Username for 'https://github.com': <username>
Password for 'https://<username>@github.com':
Counting objects: 9, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 2.10 KiB | 1.05 MiB/s, done.
Total 9 (delta 6), reused 0 (delta 0)
remote: Resolving deltas: 100% (6/6), completed with 6 local objects.
To <url>
<sha>..<sha> master -> master
Branch 'master' set up to track remote branch 'master' from 'origin'.Counting objects: 3, done.
Take a look at the repository on GitHub.
README.md
is a Markdown file giving basic information about the repository.- There is a list of files, including a folder for data.
- You are looking at a branch called “master”.
- The commit history is available from the top bar.
- The “Clone or download” button provides a URL.
GitHub Editor
The online editor is good for quick-n-easy fixes, and for working on documentation. Its a bad place to modify code, because it’s not tested before reaching the origin. Nevertheless … try it out on README.md.
Merging
An essential component of the centralized workflow is the ability to merge commit histories that have diverged. Each fork in the log has to be re-integrated, and git does this automatically through merging.
git add worksheet*
git commit -m 'feel the learn'
[master <sha>] feel the learn
5 files changed, 955 insertions(+)
Merge commits most commonly arise when a commit shows up on GitHub that isn’t in your local clone. Such as the current situation.
Even though these changes do not conflict, GitHub won’t allow you to push. Take a moment to read the message, it gives a good explanation of what has happened.
git push
To https://github.com/<username>/<repo>.git
! [rejected] master -> master (fetch first)
error: failed to push some refs to 'https://github.com/<username>/<repo>.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
Take the Hint!
git pull
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/<username>/<repo>
15bc488..26c2dcd master -> origin/master
Auto-merging README.md
Merge made by the 'recursive' strategy.
README.md | 1 +
1 file changed, 1 insertion(+)
The message tells you about any changes made by this merge commit, which seamlessly integrates changes to the same file by multiple authors.
Files under version control
A scripted pipeline relies heavily on plain text files (the scripts), but may include different file types for figures or even some data.
- Any file in the directory that is under version control is monitored for differences from the committed state of the project.
- Files must be added to at least one commit before they are tracked.
Not under version control
The most common pipeline integration is shared data storage, which is either too large to include or not a flat file.
- Local area network file share (e.g. Z:\\…)
- Cloud storeage (e.g. Dropbox, Google Drive)
- Database (e.g. lab PostgreSQL server)
Path to shared data
The best practice is to create of shortcut to shared data
files, and reference the local shortcut in code. For example,
use the ln
command in the shell to create a fake folder
as a relative path inside your repository.
mkdir "~/Google Drive"
ln -s cloud "~/Google Drive/"
Ignore Exclusions
data/**
cloud/**
Working with Collaborators
True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges.
- Be it data, a script, or a write-up, who has the most up-to-date version?
- Will a teammate’s work overwrite any of your own?
- How do I recover the working version of code the PI broke?
Centralized workflows, managed by git
, help solve these challenges.
Project Integrity
- The origin becomes the official up-to-date repo, even if your work is a few commits ahead.
- Diverging files are easily reintegrated with a merge algorithm.
- The complete project history is available to checkout.
Note, version control works really well with text. Non-textual components of your project (e.g. large or binary data) rarely live in a repository. Use cloud storage for more static files and a database for dynamic records.
Create a new file
Create a new text file as below, adding yourself as the first collaborator.
## Project Collaborators
- ...
- My neighbor!
Our aim is to let your project collaborator replace “My neighbor!” with his or her name.
Track it with git
Before you can commit changes involving a new file, you have to tell the version control system (that’s git
!) to watch it..
git add collaborators.md
git commit -m 'just me so far!'
Push
Look at the git status
and notice that your branch is ahead of origin/master! Push those commit(s) to your GitHub repo.
Collaborate!
The first step to collaborative workflows is granting access to the origin of your project. Introduce yourself to your neighbor, and ask for his/her GitHub username.
Add your neighbor as a collaborator, and accept your neighbor’s invitation to collaborate!
As a collaborator on your neighbors repository, you have permission to edit their collaborators.md
.
The text below shows “My Neighbor!” where you should see your neighbor’s name. Edit the file in your neighbor’s repo, by replacing the remaining “My neighbor!” with your own name.
## Project Collaborators
- My neighbor!
- ...
Write a meaningful commit message to save your work.
Integrate your Collaborator’s work
If you have no uncommited work in your tracked files, you can pull down the new commit from your neighbor and “fast-forward” your project.
git pull
Share and Share Back
-
The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also available for review and extension by your research community.
-
GitHub is the home of the vast majority of open source sofware, including R and Pythong packages, that help research advance. Through GitHub you can track issues with software you use, pitch in on solving problems, and even submit “pull requests” for new features you develop.