Storing Data On Archive

From Storrs HPC Wiki
Revision as of 13:21, 9 August 2017 by Lwm14001 (talk | contribs) (First Draft: Contains unnecessary timing info, primarily for Pariksheet to see, and help condense, if it's removed from the table we can add more useful info to the table.)
Jump to: navigation, search

There are several ways to generate and maintain backups, which must be stored on /archive/, and depending on what you'd like to use the backups for, you'll prefer some options over the others. Three of the main factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive. You must also consider whether you'd only like the most recent backup of the current state, or if you'd like backups from say: a day, a week, and a month ago. If you would like help automating the backup process please let us know.

Method Required Effort Full Transfer time Partial Transfer time Restoring 1 file time Checkpointed
tar Moderate BASH scripting 3x 3x 1/2x no
rsync One BASH command 20x 1x ~0x (depends on file) yes
globus Web Interface 10x 1x ~0x (depends on file) yes
runtimes are proportional to one `rsync -avhu src/ dest/`, where src and dest directories already match.

Rsync Backups

The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.

Assuming you have a directory on /archive/ named after your username, to mirror your home directory to /archive/ you would do:

mkdir -p /archive/$USER/backups/home/
rsync -avhu ~/ /archive/$USER/backups/home/

Will only update files at the destination directory which are older than their source counterparts, meaning that even though your first run of rsync will be as slow/slower than a tar backup, future updates with rsync will be faster.

Restoring from an rsync back up is simple you just swap the source and destination arguments, and if you just want to grab a couple files you can copy them back over, like you normally copy files.


Just select /archive/your_netid/ as the destination in globus transfer, and follow the guide here, noting that you can select multiple files to transfer at once.

Disk Image Style Backups (tar)

While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time. The following example script will backup your home directory and write the backup to /scratch/$USER/backups/, where $USER evaluates to your user name. Be aware that were you to use this script you have to move the backup file from /scratch/ to /archive/, otherwise the backup will get deleted by our scratch clean-up scripts.

#SBATCH -p debug
#SBATCH -c 2

# The goal of this example script is to generate a backup of our home directory
# so we set TARGET equal to our home directory (~/). If you re-use this script
# all you have to do is change ~/ to whatever directory you want to backup

# This generates a backup directory on scratch, as the compute nodes do not have
# access to /archive/ so we'll have to copy the resultant archive after the
# script is done
mkdir -p /scratch/$USER/backups/

# The following line will create an archive out of the path assigned to TARGET
# in the above new directory with the year-month-day_hours_minutes_seconds
# format.  This format keeps specific backups seperate from all future and
# previous backups.
tar -cvzf /scratch/$USER/backups/`date "+%Y-%m-%d_%Hh_%Mm_%Ss"`.tar.gz $TARGET

The main issue with generating backups this way is you have to regenerate the .tar.gz file every time which involves copying every file into the archive. This does mean that you cannot easily cancel and restart the process. However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup then copy it over to /archive/ once the backup is done. If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.

mkdir -p /archive/$USER/backups/
cp -vu /scratch/$USER/backups/* /archive/$USER/backups/

While other methods will re-index they will not have to copy all the files over every time. With a tar ball you are moving extra copies to archive ever time you do a backup. This redundancy speed loss is offset, by the fact that the tar ball is compressed. Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.

Move the archive to /archive/ for storage, and copy it back onto scratch when you'd like to unpack it. You can extract the archive with:

tar -xvf backup_name.tar.gz

This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed, to just copying a single file with globus

This will extract the archive into the current working directory overwriting files in the directory with their versions in the backup. Check the tar man page for more options.