Difference between revisions of "Storing Data On Archive"

From Storrs HPC Wiki
Jump to: navigation, search
(Generating Backups to Store on /archive/)
(First Draft: Contains unnecessary timing info, primarily for Pariksheet to see, and help condense, if it's removed from the table we can add more useful info to the table.)
Line 1: Line 1:
There are several ways to generate and maintain backups, which should be stored on /archive/, and depending on what you'd like to use the backups for, you'll prefer some options over the others.  
+
There are several ways to generate and maintain backups, which must be stored on /archive/, and depending on what you'd like to use the backups for, you'll prefer some options over the others.  
 
Three of the main factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive.  
 
Three of the main factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive.  
You must also consider whether you'd only like the most recent backup of the current state, which is what you'd be doing to make sure data on /scratch is resistant to file system failure.
+
You must also consider whether you'd only like the most recent backup of the current state, or if you'd like backups from say: a day, a week, and a month ago.
Or, if you'd like older versions to exist, so that if you obliterate some of your results you can just restore from a backup.  
+
If you would like help automating the backup process please let us know.
We keep backups like this for the system configuration directories, and your home directories.
 
  
The backup process can be automated, please contact us if you need help automating any of the following options.
+
{| border=1
 +
|-
 +
! Method
 +
! Required Effort
 +
! Full Transfer time
 +
! Partial Transfer time
 +
! Restoring 1 file time
 +
! Checkpointed
 +
|-
 +
| tar
 +
| Moderate BASH scripting
 +
| 3x
 +
| 3x
 +
| 1/2x
 +
| no
  
=== Disk Image Style Backups ===
+
|-
 +
| rsync
 +
| One BASH command
 +
| 20x
 +
| 1x
 +
| ~0x (depends on file)
 +
| yes
 +
|-
 +
| globus
 +
| Web Interface
 +
| 10x
 +
| 1x
 +
| ~0x (depends on file)
 +
| yes
  
While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time.
+
|+ ''runtimes are proportional to one `rsync -avhu src/ dest/`, where  src and dest  directories already match.''
 +
|}
  
tar -czf ''backup_name.tar.gz'' ''directory_to_make_backup_of/''
+
= Rsync Backups =
  
The main issue with generating backups this way is you have to regenerate the .tar.gz file every time which involves re-indexing and copying of every file.
+
The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.
While other methods will re-index they likely will not have to copy all the files over.
+
 
Though you are moving extra copies to archive with this technique, it can be useful if cluster usage is low, since you can generate the tar ball in an sbatch script.
+
Assuming you have a directory on /archive/ named after your username, to mirror your home directory to /archive/ you would do:
And, as the tar ball is compressed, it will have an easier time travelling through the low bandwidth pipe to /archive/.
+
 
 +
mkdir&nbsp;<font color="#c000c0">-p</font>&nbsp;/archive/<font color="#c000c0">$USER</font>/backups/home/
 +
rsync&nbsp;<font color="#c000c0">-avhu</font>&nbsp;~/ /archive/<font color="#c000c0">$USER</font>/backups/home/
 +
 
 +
Will only update files at the destination directory which are older than their source counterparts, meaning that even though your first run of rsync will be as slow/slower than a tar backup, future updates with rsync will be faster.
 +
 
 +
Restoring from an rsync back up is simple you just swap the source and destination arguments, and if you just want to grab a couple files you can copy them back over, like you normally copy files.
 +
 
 +
= Globus =
  
You should move the archive to /archive/ for storage, and copy it back onto scratch when you'd like to unpack it.
+
[[Globus_Connect| Just select /archive/''your_netid''/ as the destination in globus transfer, and follow the guide here, noting that you can select multiple files to transfer at once.]]
You can extract the archive with:
+
 
 +
= Disk Image Style Backups (tar)=
 +
 
 +
While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time.
 +
The following example script will backup your home directory and write the backup to /scratch/$USER/backups/, where $USER evaluates to your user name.
 +
Be aware that were you to use this script you have to move the backup file from /scratch/ to /archive/, otherwise the backup will get deleted by our scratch clean-up scripts.
 +
 
 +
<font color="#0000c0">#!/bin/bash</font>
 +
<font color="#0000c0">#SBATCH -p debug</font>
 +
<font color="#0000c0">#SBATCH -c 2</font>
 +
 +
<font color="#0000c0"># The goal of this example script is to generate a backup of our home directory</font>
 +
<font color="#0000c0"># so we set TARGET equal to our home directory (~/). If you re-use this script</font>
 +
<font color="#0000c0"># all you have to do is change ~/ to whatever directory you want to backup</font>
 +
<font color="#008080">TARGET</font>=~/
 +
 +
<font color="#0000c0"># This generates a backup directory on scratch, as the compute nodes do not have</font>
 +
<font color="#0000c0"># access to /archive/ so we'll have to copy the resultant archive after the</font>
 +
<font color="#0000c0"># script is done</font>
 +
mkdir&nbsp;<font color="#c000c0">-p</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/
 
   
 
   
  tar -xvf ''backup_name.tar.gz''
+
<font color="#0000c0"># The following line will create an archive out of the path assigned to TARGET</font>
 +
<font color="#0000c0"># in the above new directory with the year-month-day_hours_minutes_seconds</font>
 +
<font color="#0000c0"># format.&nbsp;&nbsp;This format keeps specific backups seperate from all future and</font>
 +
<font color="#0000c0"># previous backups.</font>
 +
  tar&nbsp;<font color="#c000c0">-cvzf</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/<font color="#c000c0">`date&nbsp;</font><font color="#af5f00">&quot;</font><font color="#c00000">+%Y-%m-%d_%Hh_%Mm_%Ss</font><font color="#af5f00">&quot;</font><font color="#c000c0">`</font>.tar.gz&nbsp;<font color="#c000c0">$TARGET</font>
  
This will extract the archive into the current working directory '''overwriting files in the directory with their versions in the backup'''.
+
The main issue with generating backups this way is you have to regenerate the .tar.gz file every time which involves copying every file into the archive.
Check the tar man page for more options.
+
This does mean that you cannot easily cancel and restart the process.
 +
However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup then copy it over to /archive/ once the backup is done.
 +
If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.
 +
 
 +
mkdir&nbsp;<font color="#c000c0">-p</font>&nbsp;/archive/<font color="#c000c0">$USER</font>/backups/
 +
cp&nbsp;<font color="#c000c0">-vu</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/* /archive/<font color="#c000c0">$USER</font>/backups/
  
=== Rsync Backups ===
 
  
The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.
+
While other methods will re-index they will not have to copy all the files over every time.
 +
With a tar ball you are moving extra copies to archive ever time you do  a backup.
 +
This redundancy speed loss is offset, by the fact that the tar ball is compressed.
 +
Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.
  
  rsync -avhu ''source_dir/'' ''desination_dir/''
+
Move the archive to /archive/ for storage, and copy it back onto scratch when you'd like to unpack it.
 +
You can extract the archive with:
 +
   
 +
tar&nbsp;<font color="#c000c0">-xvf</font>&nbsp;backup_name.tar.gz
  
Will only update files at destination which are older then their source counterparts, meaning that even though your first run of rsync will be as slow/slower than the above tar command, future updates with rsync will be faster.
+
This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed, to just copying a single file with globus
  
Restoring from an rsync back up is simple you just swap the source and destination arguments.
+
This will extract the archive into the current working directory '''overwriting files in the directory with their versions in the backup'''.
 +
Check the tar man page for more options.

Revision as of 13:21, 9 August 2017

There are several ways to generate and maintain backups, which must be stored on /archive/, and depending on what you'd like to use the backups for, you'll prefer some options over the others. Three of the main factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive. You must also consider whether you'd only like the most recent backup of the current state, or if you'd like backups from say: a day, a week, and a month ago. If you would like help automating the backup process please let us know.

Method Required Effort Full Transfer time Partial Transfer time Restoring 1 file time Checkpointed
tar Moderate BASH scripting 3x 3x 1/2x no
rsync One BASH command 20x 1x ~0x (depends on file) yes
globus Web Interface 10x 1x ~0x (depends on file) yes
runtimes are proportional to one `rsync -avhu src/ dest/`, where src and dest directories already match.

Rsync Backups

The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.

Assuming you have a directory on /archive/ named after your username, to mirror your home directory to /archive/ you would do:

mkdir -p /archive/$USER/backups/home/
rsync -avhu ~/ /archive/$USER/backups/home/

Will only update files at the destination directory which are older than their source counterparts, meaning that even though your first run of rsync will be as slow/slower than a tar backup, future updates with rsync will be faster.

Restoring from an rsync back up is simple you just swap the source and destination arguments, and if you just want to grab a couple files you can copy them back over, like you normally copy files.

Globus

Just select /archive/your_netid/ as the destination in globus transfer, and follow the guide here, noting that you can select multiple files to transfer at once.

Disk Image Style Backups (tar)

While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time. The following example script will backup your home directory and write the backup to /scratch/$USER/backups/, where $USER evaluates to your user name. Be aware that were you to use this script you have to move the backup file from /scratch/ to /archive/, otherwise the backup will get deleted by our scratch clean-up scripts.

#!/bin/bash 
#SBATCH -p debug
#SBATCH -c 2

# The goal of this example script is to generate a backup of our home directory
# so we set TARGET equal to our home directory (~/). If you re-use this script
# all you have to do is change ~/ to whatever directory you want to backup
TARGET=~/

# This generates a backup directory on scratch, as the compute nodes do not have
# access to /archive/ so we'll have to copy the resultant archive after the
# script is done
mkdir -p /scratch/$USER/backups/

# The following line will create an archive out of the path assigned to TARGET
# in the above new directory with the year-month-day_hours_minutes_seconds
# format.  This format keeps specific backups seperate from all future and
# previous backups.
tar -cvzf /scratch/$USER/backups/`date "+%Y-%m-%d_%Hh_%Mm_%Ss"`.tar.gz $TARGET

The main issue with generating backups this way is you have to regenerate the .tar.gz file every time which involves copying every file into the archive. This does mean that you cannot easily cancel and restart the process. However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup then copy it over to /archive/ once the backup is done. If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.

mkdir -p /archive/$USER/backups/
cp -vu /scratch/$USER/backups/* /archive/$USER/backups/


While other methods will re-index they will not have to copy all the files over every time. With a tar ball you are moving extra copies to archive ever time you do a backup. This redundancy speed loss is offset, by the fact that the tar ball is compressed. Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.

Move the archive to /archive/ for storage, and copy it back onto scratch when you'd like to unpack it. You can extract the archive with:

tar -xvf backup_name.tar.gz

This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed, to just copying a single file with globus

This will extract the archive into the current working directory overwriting files in the directory with their versions in the backup. Check the tar man page for more options.