Effective at SESYNC's closure in December 2022, this page is no longer maintained. The information may be out of date or inaccurate.

Git and the Centralized Workflow

Note: This lesson is in beta status! It may have issues that have not been addressed.

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.


Centralized Workflow

As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and—worst of all—become conflicted or lost.

This lesson explains a high level strategy for organizing your collaborative workflow and introduces accompanying software and cloud solutions. This strategy for distributed work on a shared codebase—the centralized workflow—is widespread in collaborative research.

A central hub stores project files and their history. Researchers are spokes on the wheel, working on private copies of the project. Project integrity is maintained through rules enforced by the hub for synchronizing between hub and spokes.

Top of Section


Objectives for this lesson

  • See what version control does
  • Learn about centralized workflows
  • Try out GitHub

Specific achievements

  • Make “commits” to a project file with git
  • “Push” and “pull” project work to GitHub
  • “Merge” your work with a GitHub collaborator’s

Top of Section


Git in the Shell

The namesake of GitHub is the command-line utility git. It performs the clone, push, pull, and merge procedures just mentioned, and many more.

When using git from the command line, you issue commands through the Unix shell. These commands have their own special syntax. If you aren’t familiar with Unix shell commands, you might want to look at this lesson from Software Carpentry. Or check out explainshell.com, which is a handy tool that gives you the help text associated with specific shell commands, including git commands.

Note on terminology and configuration

As of October 1, 2020, all new repositories created on GitHub will have a default branch called main. Previously, the default name was master. The change was made to promote inclusive language in the version control world. SESYNC is planning to update the GitLab server to match this new default. However, the git client will still default to master if you create a repository locally, unless you configure it as described below. You should also be aware that any documentation, tutorial, or StackOverflow post written before 2020 will assume your default branch is called master.

We recommend setting the default branch name for new repositories you create locally to main. Enter the following into your terminal prompt.

git config --global init.defaultBranch main

This option is available for git version 2.28 or later.

The software has no GUI of its own, and works through commands always beginning with git given in the shell. For example, the command to turn the “current folder” into a git repo is git init. You would run git init locally from an existing folder containing project code.

cd <path to directory>
git init

Add files to git’s watchlist with the “add” command. This action is also known as “staging” files.

git add <path to files>
git status

You can stage all files that have been modified since the last commit with git add ..

“Commit” updates the added (staged) files in a newly labeled version of your project’s history.

git commit -m "initial commit"
*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: empty ident name (for <(null)>) not allowed

The above error message appears if you have not yet configured your local machine with your GitHub user credentials.

Every commit needs an author. Follow git’s instructions, using a real email address so your commits can be associated with your GitHub account, and try again.

git commit -m "initial commit"
git status

Now, author information will be associated with any commits you make. This is a one-time configuration for each computer on which you use git.

Saving, staging, and committing are each separate steps, none of which imply any of the others. This may seem like a hassle, but is a good thing! As your project grows larger, you will frequently save changes you don’t want to commit: staging lets you choose what changes get packaged into a commit.

Look at the Log

Version control gives you access to the state of the repository at any previous commit. View this history in the log.

git log
commit <sha>
Author: <author>
Date:   <datetime>

    initial commit

Exercise 1

Edit your committed file with some small, breaking change. Create a second commit that includes this change, and make sure it shows up in the log.

Revert

Let’s investigate the most recent commit.

git show
commit <sha>
Author: <author>
Date:   <datetime>

    <message>

<diff>

The , or however many digits of it are needed, provides a unique label for each commit. Use "revert" to undo the changes introduced in a specified commit.

git revert --no-edit <sha>
[main <sha>] Revert <message>
 1 file changed, 1 insertion(+), 1 deletion(-)

Top of Section


A Plug for Reproducible Research

Reproducibility is a core tenet of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.

The principle applies equally to modeling, analysis, and perhaps most of all to data synthesis.

Hallmarks of reproducible research:

Reviewable All details of the method used are easily accessible for peer review and community review.
Auditable Records exist to document how the methods and conclusions evolved, but may be private.
Replicable Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.
Open The orginator grants permissions for reuse and extension of the research products.

Let your workflow help achieve these same goals:

Thoroughly-comment scripts and share continusously with collaborators Reviewable
Maintain project history to correct mistakes when necessary Auditable
Provide “one-click” file & data sharing, of a streamlined analysis “pipeline” Replicable
Publically release on GitHub (or similar) with (implied) open licensing Open

Top of Section


What’s a GitHub?


Image by Atlassian / CC BY

The origin is the central copy of the project, a repository that lives on GitHub. Every member of the team uses a local copy of the entire project, called a clone.


Image by Atlassian / CC BY

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to share her own work.

A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is viewing the same commit as the origin.

Notice that the local and remote (origin) repos are both on a branch called main in the diagram below. This is the default name given to the primary version of the repository.


Image by Atlassian / CC BY

When the origin has commits that do not exist in the local repo, it has gotten ahead and a pull is required to synchronize state.


Image by Atlassian / CC BY

A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up as if you had created identical commits locally.


Image by Atlassian / CC BY

In the opposite situation, commits created locally are not immediately synchronized to the origin.


Image by Atlassian / CC BY

A push copies local commits to the origin and applies them remotely.


Image by Atlassian / CC BY

Top of Section


Create a GitHub Repository

  1. Sign in or create a GitHub account.

  2. Create a personal access token.

IMPORTANT: As of August 2021, a personal access token is now required to authenticate pushing to a remote repo. The link above is to a GitHub documentation page with very detailed instructions on how to navigate to the settings page where you can generate a token. When you are prompted to select the scopes (permissions) to give the token, check the box marked repo. After you generate the token, save it in a safe place; you will need it in a moment. The best place to save it long-term is a password manager such as LastPass.

3. Create a new repository on your GitHub page.

  1. Give the repo a name
  2. Add a short “tag line” to jog your memory
  3. Do not check the box or add anything

Empty repository

You have created an empty repository. The quick start information provides clues on how to see your first commits.

Configure your clone

To push and pull from your local repo to GitHub, you must configure your local repo with the URL of the remote repo. By convention, we call the central copy the “origin”.

git remote add origin <URL>

Push your commit up to the origin.

IMPORTANT: When you are prompted to enter your password, paste your personal access token into the prompt, not the password that you use to sign in to GitHub.com in your browser. On Windows you will need to use Shift+Insert or right-click to paste, because Ctrl+V will not work in a terminal window.

git push
Username for 'https://github.com': <username>
Password for 'https://<username>@github.com': 
Counting objects: <progress>
Delta compression using up to 4 threads.
Compressing objects: <progress>
Writing objects: <progress>
<stats>
remote: Resolving deltas: <progress>
To 'https://github.com/<username>/<repo>.git'
   <sha>..<sha>  main -> main
Branch 'main' set up to track remote branch 'main' from 'origin'.
Counting objects: <progress>

Take a look at the repository on GitHub.

  • There is a space for files
  • There is a suggestion to create a README.md, a project summary in Markdown.
  • You are looking at a branch called main.
  • The commit history is available from the top bar.
  • The “Clone or download” button provides a URL.

GitHub Editor

The online editor is good for quick-n-easy fixes, and for working on documentation. Its a bad place to modify code, because it’s not tested before reaching the origin. It’s great for creating a project README.

Exercise 2

Create a new file called “README.md” and add the following content on separate lines with a blank line in between.

  1. A title, preceded by # (the markdown “level 1” heading)
  2. A “About” section, preceded by ## (the markdown “level 2” heading)
  3. A “Contributors” section, preced by ##
  4. Your name, preceded by - (the markdown bulleted list)

As you go, utilize the Preview tab to see the result of rendering your Markdown to HTML.

Top of Section


Merging

An essential component of the centralized workflow is the ability to merge commit histories that have diverged. Each fork in the log has to be re-integrated, and git does this automatically through merging.

git add <path>
git commit -m 'feel the learn'
[main <sha>] feel the learn
 5 files changed, 955 insertions(+)

Merge commits most commonly arise when a commit shows up on GitHub that isn’t in your local clone. Such as the current situation.


Image by Atlassian / [CC BY]

Even though these changes do not conflict, GitHub won’t allow you to push. Take a moment to read the message, it gives a good explanation of what has happened.

git push
To https://github.com/<username>/<repo>.git
 ! [rejected]        main -> main (fetch first)
 error: failed to push some refs to 'https://github.com/<username>/<repo>.git'
 hint: Updates were rejected because the remote contains work that you do
 hint: not have locally. This is usually caused by another repository pushing
 hint: to the same ref. You may want to first integrate the remote changes
 hint: (e.g., 'git pull ...') before pushing again.
 hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Merge Locally

The origin does not even attempt to reconcile diverging commit histories; it does not matter that the diverging commits affect separate files. In order to preserve the repo, the contributor is always responsible for “overseeing” the merge on a local clone.

Take the Hint!

git pull
remote: Counting objects: <progress>
remote: Compressing objects: <progress>
remote: <stats>
Unpacking objects: <progress>
From https://github.com/<username>/<repo>
   <sha>..<sha>  main     -> origin/main
   Auto-merging README.md
   Merge made by the 'recursive' strategy.
    README.md | 1 +
	 1 file changed, 1 insertion(+)

The message tells you about any changes made by this merge commit, which seamlessly integrates changes to the same file by multiple authors.

Top of Section


Working with Collaborators

True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges.

  • Be it data, a script, or a write-up, who has the most up-to-date version?
  • Will a teammate’s work overwrite any of your own?
  • How do I recover the working version of code the PI broke?

Centralized workflows, managed by git, help solve these challenges.

Project Integrity

  • The origin becomes the official up-to-date repo, even if your work is a few commits ahead.
  • Diverging files are easily reintegrated with a merge algorithm.
  • The complete project history is available to checkout.

Note, version control works really well with text. Non-textual components of your project (e.g. large or binary data) rarely live in a repository. Use cloud storage for more static files and a database for dynamic records.

## Collaborators

- <your name>
- My Neighbor

Add a section where you can list collaborators to the README.md file. Our aim is to let your project collaborator replace “My Neighbor” with his or her name.

Commit it with git

Before you can commit changes involving a new file, you have to tell the version control system (that’s git!) what changes to include.

git add README.md
git commit -m 'just me so far!'

Push

Look at the git status and notice that your branch is ahead of origin/main! Push those commit(s) to your GitHub repo.

Collaborate!

The first step to collaborative workflows is granting access to the origin of your project. Introduce yourself to your neighbor, and decide which of you will be the “owner” and which the “collaborator”. The owner will need the collaborator’s GitHub username.

Even on public GitHub repos, only the owner has “push access” by default. The owner can allow any other GitHub user to push by inviting collaborators under the Settings tab (Settings > Manage access > Invite a collaborator).

Add your neighbor as a collaborator!

Exercise 3

As the collaborator on your neighbor’s repository, you have permission to edit his or her README.md. Make sure you accept the invitation to collaborate in your email!

The text below shows where you’ll see the owner’s name if you’re looking at the right (not your own). The collaborator should edit the file in the owner’s repo, by replacing “My Neighbor” with his or her own name.

## Collaborators

- <the owner's name>
- <your name>

Write a meaningful commit message while “saving” your work. Note that on the GitHub editor, there’s no distinction between save and commit. The owners should then pull the new commit into their local clone of the project.

Top of Section


Merge Conflicts

Diverging commits that do not affect the same files, or affect different lines within a file, can usually be merged automatically. That’s what happened in the previous example where everything happened in sequence. First, the owner committed and pushed, then the collaborator pulled, committed, and pushed, then the owner pulled again. But if both owner and collaborator modify the same file simultaneously, git cannot safely merge the commits because it has no way of knowing which version to use. If git cannot safely merge commits, it guides you through conflict resolution.

A “merge conflict” will arise when two contributors change a line of text. For example, if you both add a project description.

The owner adds a description under “# About” in the local clone. Meanwhile the collaborator adds a description under “# About” using the GitHub editor in the owner’s repository.

# About

...

The owner commits his or her change, but receives an error message from git when attempting to pull.

git pull
CONFLICT (content): Merge conflict in <path>
Automatic merge failed; fix conflicts and then commit the result.

Any conflicted region is fenced off in the named files with conflict markers and must be manually tidied up.

<<<<<<< indicates the beginning of your version of the conflicted section, then ======= indicates the beginning of your neighbor’s version, which ends with >>>>>>>.

<<<<<<< HEAD:main
 ...
=======
 ...
>>>>>>>

Follow all the instructions in the original message (or ask again with a git status):

git status
You have unmerged paths.
 (fix conflicts and run "git commit")
 
Unmerged paths:
 (use "git add <file>..." to mark resolution)

Important note: If you find resolving merge conflicts confusing, the best way to avoid them is to pull before you push! That means always pull the most recent version of the repo from the remote before making changes. That way, merge conflicts will only occur if you and your collaborator(s) are working on the code at the exact same time.

Exercise 4

Switch roles with your neighbor and repeat both Exercise 3 and the steps above to introduce and resolve a merge conflict.

Top of Section


Share and Contribute

  • The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also available for review and extension by your research community.

  • GitHub is the home of the vast majority of open source sofware, including R and Python packages, that help research advance. Through GitHub you can track issues with software you use, pitch in on solving problems, and even submit “pull requests” for new features you develop.

Top of Section


If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!