Bringing Git to data archival

Tue 15 January 2013

I am increasingly excited about distributed version control and how it enables easy collaboration between software developers without technical and social barriers such as synchronisation of work and maintenance of control.</p>

The obvious question is how the DVC systems can be applied to scientific collaboration and my particular specialism -- large scale data access and archiving for scientists. I'm not alone in asking the question either. There was a flurry of discussion in the blogasphere in 2011 around the idea of a GitHub for Science and Git has regularly come up when discussing solutions for managing the CMIP5 archive.

Therefore I was really excited when I discovered git-annex. This tool looks like an excellent fit for solving many of the challenges we have faced in developing the data infrastructure for CMIP5 and has the potential to bring a more radical git-like workflow to how scientists obtain data. To explain why I need to describe a little about one aspect of data management for CMIP5.

CMIP5 and drslib

My particular contribution to CMIP5 has been drslib, a library for maintaining the directory structure used to store CMIP5's 1-2Pb of data. To cut a long story short drslib maintains a tree of thousands of dataset directories each containing a collection of files ranging from MBs to 10s of GB. Each dataset can go through several versions and each version is visible on the filesystem as a separate subdirectory. The challange for drslib is:

  1. Manage changes from one version to another.
  2. De-duplicate the storage so that a file which exists in multiple versions does not need to be stored twice.
  3. Just use the filesystem so that standard data transfer tools like FTP would work.

This is achieved by storing all files in a separate storage subdirectory inspiringly called "files" and symbolically linking files from there to version subdirectories named v<YYYYMMDD>. For instance structure of a dataset with 2 variables and 2 versions looks something like this:

$ tree -Fd .
├── files
│   ├── sbl_20111109
│   ├── sbl_20120105
│   ├── snw_20111109
│   └── snw_20120105
├── latest -> v20120105
├── v20111109
│   ├── sbl
│   └── snw
└── v20120105
    ├── sbl
    └── snw

Where each leaf directory in files contains the real data and the leaf directories v<YYYYMMDD> contains symbolic links.

Symbolic linking was very controversial in the project but appeared to be the only way of avoiding storing a file twice whilst supporting FTP-like services. As you will discover I feel somewhat vindicated that this was a sensible design.

git-annex in a nutshell

Git-annex is one of several tools which solves the problem of Git not working well with large files. The git-annex website has an excellent explanation of the other options and why git-annex is distinctive. I think it is a particularly good fit for our use cases.

Git-annex stores large files in the subdirectory .git/annex and only checks in metadata about the file into git. Each clone of the repository keeps a complete history of where a annexed file can be found, either from a remote's annex, the web or something called a special remote, but the clone only downloads the file itself if requested. It then symbolically links the file into the working tree. The result is remarkably similar to what drslib, only much better engineered of course!

An example

I will demonstrate by showing how you might migrate a CMIP5 dataset directory from drslib's structure to one managed by git-annex without changing the versioned file paths and without sacrificing deduplication.

The resulting annex structure is a little more complex than drslib's files directory but manageably so. The annex has a similar structure to git's object database only with configurable object naming. See git-annex internals for details.

$ find .git/annex/objects -type d | head -n 20


So git-annex is remarkably similar to drslib in the way it de-duplicates large files on the file system. It could replace drslib's de-duplication and version transition logic without having any impact on what the end-user sees. These features come with the full advantage of git for robust version tracking and cloning. I will be investigating this further as we prepare to take on data from the CORDEX project.

After a few days of investigating git-annex the software seems remarkably robust and worth pursuing further. Development is active and there are RPM and DEB packages available. There are many possibilities beyond this narrow use case that could be opened up if we can make it work such as replication via cloned repositories or allowing advanced users to clone a repository to get a tracable version history.

Category: Data Science Tagged: esgf cmip5