mebioda

Versioning

Version management

Backup

History

Provenance

The problem: how to manage ongoing change

Documents go through many versions, over the course of years
Meanwhile, related files (data, scripts, figures, bibliographies, etc.) are changing as well
You are probably collaborating with others (coauthors, reviewers, editors, etc.)
You probably will not know ahead of time which will be the “final” document version

Part of the solution: work in a project-oriented manner

WS Noble, 2009. A Quick Guide to Organizing Computational Biology Projects. PLoS Computational Biology 5(7): e1000424. doi:10.1371/journal.pcbi.1000424

Organize your work in projects (e.g. GTD philosophy)
Follow consistent naming schemes (e.g. ISO-8601 date format)
This allows us to plan for open-ended change by date-stamping folders, but:
- How to deal with mistakes?
- How to “undo”?
- How to collaborate?

We are doing local operations (e.g. writing, editing (code, data), analyzing) on a local copy that is backed up to a remote folder
This is roughly the workflow in dropbox, google drive (where synchronization happens as a background process) as well as in centralized version control systems such as cvs and svn (where synchronization happens after a “commit”).
However, this easily results in conflicts:
- Multiple people work on their local copies of different files at the same time
- When synchronization happens, different files (and their versions) may arrive in the wrong order
- Relationships between files may become scrambled, race conditions may arise, etc.

The distributed approach

Everyone takes individual ownership of the consistency of the entire project
Only after all local changes have been comitted to the local repository do we push the repository “upstream”
If the upstream repository has changed since the last time the local was synchronized, potential conflicts need to be resolved first

`git` - distributed version control

git is an approach to distributed version control that was developed to manage the contributions of thousands of people to the source code of Linux.
There is a standard command line tool for all major operating systems. There are also graphical interfaces (which are not that useful), and plug-ins for development tools (such as RStudio, which is very useful)
Many websites provide remote hosting for git projects. GitHub is a very popular one, but there are others, such as BitBucket or SourceForge

`git` workflow - starting a repository

The first step is to create a repository. This can be:
- A new repository that you start locally (e.g. git init .)
- A repository you start remotely (e.g. “New repository”, top-right menu on GitHub)
- A repository that you fork remotely (e.g. “Fork” button, top-right, on GitHub)
Then, if the repository was started or forked remotely you want a local clone, e.g.:

# using HTTPS:
$ git clone https://github.com/<user name>/<project name>.git

# using SSH:
$ git clone git@github.com:<user name>/<project name>.git

`git` workflow - adding a file

Let’s say I have a file data.tsv that I want to add to a repository:

# I have placed my file in my local repository and check to see if git notices:
$ git status

# response:
On branch master
Your branch is up-to-date with 'origin/master'.
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data.tsv

Now add it:

$ git add data.tsv
$ git status

# response:
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	new file:   data.tsv

So now the file is ready to be “committed”. Let’s commit it to our local repository, and add a message (-m 'message text goes here):

$ git commit -m 'adding file data.tsv' data.tsv

# response:
[master d588593] adding data.tsv
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 doc/week2/w2d3/data.tsv

$ git status

# response:
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

Now our file has been incorporated in our local repository. If this is a fork of a remote repository we can “push” the file upstream to the remote repo:

$ git push

# response:
Counting objects: 5, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 455 bytes | 455.00 KiB/s, done.
Total 5 (delta 4), reused 0 (delta 0)
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
To github.com:naturalis/mebioda.git
   0f031d6..d588593  master -> master

Distributed version control: why?

Infinite undo, all the way to the beginning of the project
Backups on the servers of the version control host (e.g. github)
Explicit history with messages explaining why files were changed and ways to tag specific versions
Distributed collaboration including mechanisms for experimentation (branches) and resolving conflicts

In addition, GitHub provides for a lot of functionality on top of git:

Project management tools, such as an issue tracker
Ways to work in the browser instead of the command line (e.g. to upload or edit and commit without the command line)
Ways to test code automatically
A facility to give a DOI to a specific version

This site is open source. Improve this page.

mebioda

Versioning

The problem: how to manage ongoing change

Part of the solution: work in a project-oriented manner

Backing up and sharing your projects

The distributed approach

git - distributed version control

git workflow - starting a repository

git workflow - adding a file

Distributed version control: why?

`git` - distributed version control

`git` workflow - starting a repository

`git` workflow - adding a file