Difference between revisions of "Storing Data On Archive"

From Storrs HPC Wiki
Jump to: navigation, search
(First Draft: Contains unnecessary timing info, primarily for Pariksheet to see, and help condense, if it's removed from the table we can add more useful info to the table.)
m (DOC: Wording changes)
 
Line 1: Line 1:
There are several ways to generate and maintain backups, which must be stored on /archive/, and depending on what you'd like to use the backups for, you'll prefer some options over the others.  
+
There are several ways to generate and maintain backups and depending on what you'd like to use the backups for, you'll prefer some options over the others.  
Three of the main factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive.  
+
The three of the main performance factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive.  
You must also consider whether you'd only like the most recent backup of the current state, or if you'd like backups from say: a day, a week, and a month ago.
+
You must also consider whether you'd only like backups of the current state, or if you'd like distinct backups from say: a day, a week, and a month ago.
 
If you would like help automating the backup process please let us know.
 
If you would like help automating the backup process please let us know.
  
Line 65: Line 65:
 
  <font color="#0000c0">#SBATCH -c 2</font>
 
  <font color="#0000c0">#SBATCH -c 2</font>
 
   
 
   
  <font color="#0000c0"># The goal of this example script is to generate a backup of our home directory</font>
+
  <font color="#0000c0"># The goal of this example script is to generate a backup of our home directory.</font>
  <font color="#0000c0"># so we set TARGET equal to our home directory (~/). If you re-use this script</font>
+
  <font color="#0000c0"># So, we begin by setting TARGET equal to our home directory i.e. ~/ . </font>
  <font color="#0000c0"># all you have to do is change ~/ to whatever directory you want to backup</font>
+
  <font color="#0000c0"># If you re-use this script you should change ~/ to whatever directory you want to backup.</font>
 
  <font color="#008080">TARGET</font>=~/
 
  <font color="#008080">TARGET</font>=~/
 
   
 
   
  <font color="#0000c0"># This generates a backup directory on scratch, as the compute nodes do not have</font>
+
  <font color="#0000c0"># The following generates a backup directory on '''/scratch/'''. As the compute nodes do not have</font>
  <font color="#0000c0"># access to /archive/ so we'll have to copy the resultant archive after the</font>
+
  <font color="#0000c0"># access to /archive/ '''we'll have to copy the backup to /archive/ after the script is done'''</font>
<font color="#0000c0"># script is done</font>
 
 
  mkdir&nbsp;<font color="#c000c0">-p</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/
 
  mkdir&nbsp;<font color="#c000c0">-p</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/
 
   
 
   
  <font color="#0000c0"># The following line will create an archive out of the path assigned to TARGET</font>
+
  <font color="#0000c0"># The following line will create a backup from the path assigned to TARGET</font>
  <font color="#0000c0"># in the above new directory with the year-month-day_hours_minutes_seconds</font>
+
  <font color="#0000c0"># in the above directory; with the year-month-day_hours_minutes_seconds</font>
 
  <font color="#0000c0"># format.&nbsp;&nbsp;This format keeps specific backups seperate from all future and</font>
 
  <font color="#0000c0"># format.&nbsp;&nbsp;This format keeps specific backups seperate from all future and</font>
 
  <font color="#0000c0"># previous backups.</font>
 
  <font color="#0000c0"># previous backups.</font>
 
  tar&nbsp;<font color="#c000c0">-cvzf</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/<font color="#c000c0">`date&nbsp;</font><font color="#af5f00">&quot;</font><font color="#c00000">+%Y-%m-%d_%Hh_%Mm_%Ss</font><font color="#af5f00">&quot;</font><font color="#c000c0">`</font>.tar.gz&nbsp;<font color="#c000c0">$TARGET</font>
 
  tar&nbsp;<font color="#c000c0">-cvzf</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/<font color="#c000c0">`date&nbsp;</font><font color="#af5f00">&quot;</font><font color="#c00000">+%Y-%m-%d_%Hh_%Mm_%Ss</font><font color="#af5f00">&quot;</font><font color="#c000c0">`</font>.tar.gz&nbsp;<font color="#c000c0">$TARGET</font>
  
The main issue with generating backups this way is you have to regenerate the .tar.gz file every time which involves copying every file into the archive.
+
The main issue with tarball backups is you have to regenerate the .tar.gz file every time.
This does mean that you cannot easily cancel and restart the process.
+
This implies copying every file into the archive!
However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup then copy it over to /archive/ once the backup is done.
+
This also means that you cannot easily cancel and restart the process.
 +
However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup.
 +
Once the job is done you copy the backup in /scratch/$USER/backups/ over to /archive/.
 
If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.
 
If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.
  
Line 90: Line 91:
  
  
While other methods will re-index they will not have to copy all the files over every time.
+
While other methods will re-index every file they will not have to copy every file into a tar ball every time.
With a tar ball you are moving extra copies to archive ever time you do  a backup.  
+
This speed loss is offset by the fact that the tar ball is compressed.  
This redundancy speed loss is offset, by the fact that the tar ball is compressed.  
 
 
Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.
 
Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.
  
Move the archive to /archive/ for storage, and copy it back onto scratch when you'd like to unpack it.  
+
In total you have to generate a tar ball, move the tar ball to /archive/ for storage, and then copy it back onto scratch when you'd like to unpack it.  
You can extract the archive with:
+
This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed to copying a single file with globus.
+
 
 +
You can extract the archive into you current directory with:
 +
 
 
  tar&nbsp;<font color="#c000c0">-xvf</font>&nbsp;backup_name.tar.gz
 
  tar&nbsp;<font color="#c000c0">-xvf</font>&nbsp;backup_name.tar.gz
 
This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed, to just copying a single file with globus
 
  
 
This will extract the archive into the current working directory '''overwriting files in the directory with their versions in the backup'''.
 
This will extract the archive into the current working directory '''overwriting files in the directory with their versions in the backup'''.
 
Check the tar man page for more options.
 
Check the tar man page for more options.
 +
Particularly the --skip-old-files option which will not overwrite files, when unpacking, if the file in the tar ball is older than the one in the directory you unpack into

Latest revision as of 14:32, 1 December 2017

There are several ways to generate and maintain backups and depending on what you'd like to use the backups for, you'll prefer some options over the others. The three of the main performance factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive. You must also consider whether you'd only like backups of the current state, or if you'd like distinct backups from say: a day, a week, and a month ago. If you would like help automating the backup process please let us know.

Method Required Effort Full Transfer time Partial Transfer time Restoring 1 file time Checkpointed
tar Moderate BASH scripting 3x 3x 1/2x no
rsync One BASH command 20x 1x ~0x (depends on file) yes
globus Web Interface 10x 1x ~0x (depends on file) yes
runtimes are proportional to one `rsync -avhu src/ dest/`, where src and dest directories already match.

Rsync Backups

The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.

Assuming you have a directory on /archive/ named after your username, to mirror your home directory to /archive/ you would do:

mkdir -p /archive/$USER/backups/home/
rsync -avhu ~/ /archive/$USER/backups/home/

Will only update files at the destination directory which are older than their source counterparts, meaning that even though your first run of rsync will be as slow/slower than a tar backup, future updates with rsync will be faster.

Restoring from an rsync back up is simple you just swap the source and destination arguments, and if you just want to grab a couple files you can copy them back over, like you normally copy files.

Globus

Just select /archive/your_netid/ as the destination in globus transfer, and follow the guide here, noting that you can select multiple files to transfer at once.

Disk Image Style Backups (tar)

While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time. The following example script will backup your home directory and write the backup to /scratch/$USER/backups/, where $USER evaluates to your user name. Be aware that were you to use this script you have to move the backup file from /scratch/ to /archive/, otherwise the backup will get deleted by our scratch clean-up scripts.

#!/bin/bash 
#SBATCH -p debug
#SBATCH -c 2

# The goal of this example script is to generate a backup of our home directory.
# So, we begin by setting TARGET equal to our home directory i.e. ~/ . 
# If you re-use this script you should change ~/ to whatever directory you want to backup.
TARGET=~/

# The following generates a backup directory on /scratch/. As the compute nodes do not have
# access to /archive/ we'll have to copy the backup to /archive/ after the script is done
mkdir -p /scratch/$USER/backups/

# The following line will create a backup from the path assigned to TARGET
# in the above directory; with the year-month-day_hours_minutes_seconds
# format.  This format keeps specific backups seperate from all future and
# previous backups.
tar -cvzf /scratch/$USER/backups/`date "+%Y-%m-%d_%Hh_%Mm_%Ss"`.tar.gz $TARGET

The main issue with tarball backups is you have to regenerate the .tar.gz file every time. This implies copying every file into the archive! This also means that you cannot easily cancel and restart the process. However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup. Once the job is done you copy the backup in /scratch/$USER/backups/ over to /archive/. If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.

mkdir -p /archive/$USER/backups/
cp -vu /scratch/$USER/backups/* /archive/$USER/backups/


While other methods will re-index every file they will not have to copy every file into a tar ball every time. This speed loss is offset by the fact that the tar ball is compressed. Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.

In total you have to generate a tar ball, move the tar ball to /archive/ for storage, and then copy it back onto scratch when you'd like to unpack it. This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed to copying a single file with globus.

You can extract the archive into you current directory with:

tar -xvf backup_name.tar.gz

This will extract the archive into the current working directory overwriting files in the directory with their versions in the backup. Check the tar man page for more options. Particularly the --skip-old-files option which will not overwrite files, when unpacking, if the file in the tar ball is older than the one in the directory you unpack into