Storing Data On Archive

From Storrs HPC Wiki
Revision as of 15:32, 1 December 2017 by Lwm14001 (talk | contribs) (DOC: Wording changes)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

There are several ways to generate and maintain backups and depending on what you'd like to use the backups for, you'll prefer some options over the others. The three of the main performance factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive. You must also consider whether you'd only like backups of the current state, or if you'd like distinct backups from say: a day, a week, and a month ago. If you would like help automating the backup process please let us know.

Method Required Effort Full Transfer time Partial Transfer time Restoring 1 file time Checkpointed
tar Moderate BASH scripting 3x 3x 1/2x no
rsync One BASH command 20x 1x ~0x (depends on file) yes
globus Web Interface 10x 1x ~0x (depends on file) yes
runtimes are proportional to one `rsync -avhu src/ dest/`, where src and dest directories already match.

Rsync Backups

The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.

Assuming you have a directory on /archive/ named after your username, to mirror your home directory to /archive/ you would do:

mkdir -p /archive/$USER/backups/home/
rsync -avhu ~/ /archive/$USER/backups/home/

Will only update files at the destination directory which are older than their source counterparts, meaning that even though your first run of rsync will be as slow/slower than a tar backup, future updates with rsync will be faster.

Restoring from an rsync back up is simple you just swap the source and destination arguments, and if you just want to grab a couple files you can copy them back over, like you normally copy files.

Globus

Just select /archive/your_netid/ as the destination in globus transfer, and follow the guide here, noting that you can select multiple files to transfer at once.

Disk Image Style Backups (tar)

While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time. The following example script will backup your home directory and write the backup to /scratch/$USER/backups/, where $USER evaluates to your user name. Be aware that were you to use this script you have to move the backup file from /scratch/ to /archive/, otherwise the backup will get deleted by our scratch clean-up scripts.

#!/bin/bash 
#SBATCH -p debug
#SBATCH -c 2

# The goal of this example script is to generate a backup of our home directory.
# So, we begin by setting TARGET equal to our home directory i.e. ~/ . 
# If you re-use this script you should change ~/ to whatever directory you want to backup.
TARGET=~/

# The following generates a backup directory on /scratch/. As the compute nodes do not have
# access to /archive/ we'll have to copy the backup to /archive/ after the script is done
mkdir -p /scratch/$USER/backups/

# The following line will create a backup from the path assigned to TARGET
# in the above directory; with the year-month-day_hours_minutes_seconds
# format.  This format keeps specific backups seperate from all future and
# previous backups.
tar -cvzf /scratch/$USER/backups/`date "+%Y-%m-%d_%Hh_%Mm_%Ss"`.tar.gz $TARGET

The main issue with tarball backups is you have to regenerate the .tar.gz file every time. This implies copying every file into the archive! This also means that you cannot easily cancel and restart the process. However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup. Once the job is done you copy the backup in /scratch/$USER/backups/ over to /archive/. If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.

mkdir -p /archive/$USER/backups/
cp -vu /scratch/$USER/backups/* /archive/$USER/backups/


While other methods will re-index every file they will not have to copy every file into a tar ball every time. This speed loss is offset by the fact that the tar ball is compressed. Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.

In total you have to generate a tar ball, move the tar ball to /archive/ for storage, and then copy it back onto scratch when you'd like to unpack it. This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed to copying a single file with globus.

You can extract the archive into you current directory with:

tar -xvf backup_name.tar.gz

This will extract the archive into the current working directory overwriting files in the directory with their versions in the backup. Check the tar man page for more options. Particularly the --skip-old-files option which will not overwrite files, when unpacking, if the file in the tar ball is older than the one in the directory you unpack into