Difference between revisions of "Storing Data On Archive"

From Storrs HPC Wiki
Jump to: navigation, search
(Generating Backups to Store on /archive/)
m (DOC: Wording changes)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
There are several ways to generate and maintain backups, which should be stored on /archive/, and depending on what you'd like to use the backups for, you'll prefer some options over the others.  
+
There are several ways to generate and maintain backups and depending on what you'd like to use the backups for, you'll prefer some options over the others.  
Three of the main factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive.  
+
The three of the main performance factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive.  
You must also consider whether you'd only like the most recent backup of the current state, which is what you'd be doing to make sure data on /scratch is resistant to file system failure.
+
You must also consider whether you'd only like backups of the current state, or if you'd like distinct backups from say: a day, a week, and a month ago.
Or, if you'd like older versions to exist, so that if you obliterate some of your results you can just restore from a backup.  
+
If you would like help automating the backup process please let us know.
We keep backups like this for the system configuration directories, and your home directories.
 
  
The backup process can be automated, please contact us if you need help automating any of the following options.
+
{| border=1
 +
|-
 +
! Method
 +
! Required Effort
 +
! Full Transfer time
 +
! Partial Transfer time
 +
! Restoring 1 file time
 +
! Checkpointed
 +
|-
 +
| tar
 +
| Moderate BASH scripting
 +
| 3x
 +
| 3x
 +
| 1/2x
 +
| no
  
=== Disk Image Style Backups ===
+
|-
 +
| rsync
 +
| One BASH command
 +
| 20x
 +
| 1x
 +
| ~0x (depends on file)
 +
| yes
 +
|-
 +
| globus
 +
| Web Interface
 +
| 10x
 +
| 1x
 +
| ~0x (depends on file)
 +
| yes
  
While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time.
+
|+ ''runtimes are proportional to one `rsync -avhu src/ dest/`, where  src and dest  directories already match.''
 +
|}
  
tar -czf ''backup_name.tar.gz'' ''directory_to_make_backup_of/''
+
= Rsync Backups =
  
The main issue with generating backups this way is you have to regenerate the .tar.gz file every time which involves re-indexing and copying of every file.
+
The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.
While other methods will re-index they likely will not have to copy all the files over.
+
 
Though you are moving extra copies to archive with this technique, it can be useful if cluster usage is low, since you can generate the tar ball in an sbatch script.  
+
Assuming you have a directory on /archive/ named after your username, to mirror your home directory to /archive/ you would do:
And, as the tar ball is compressed, it will have an easier time travelling through the low bandwidth pipe to /archive/.
+
 
 +
mkdir&nbsp;<font color="#c000c0">-p</font>&nbsp;/archive/<font color="#c000c0">$USER</font>/backups/home/
 +
rsync&nbsp;<font color="#c000c0">-avhu</font>&nbsp;~/ /archive/<font color="#c000c0">$USER</font>/backups/home/
 +
 
 +
Will only update files at the destination directory which are older than their source counterparts, meaning that even though your first run of rsync will be as slow/slower than a tar backup, future updates with rsync will be faster.
 +
 
 +
Restoring from an rsync back up is simple you just swap the source and destination arguments, and if you just want to grab a couple files you can copy them back over, like you normally copy files.
 +
 
 +
= Globus =
 +
 
 +
[[Globus_Connect| Just select /archive/''your_netid''/ as the destination in globus transfer, and follow the guide here, noting that you can select multiple files to transfer at once.]]
 +
 
 +
= Disk Image Style Backups (tar)=
 +
 
 +
While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time.  
 +
The following example script will backup your home directory and write the backup to /scratch/$USER/backups/, where $USER evaluates to your user name.
 +
Be aware that were you to use this script you have to move the backup file from /scratch/ to /archive/, otherwise the backup will get deleted by our scratch clean-up scripts.
  
You should move the archive to /archive/ for storage, and copy it back onto scratch when you'd like to unpack it.  
+
<font color="#0000c0">#!/bin/bash</font>
You can extract the archive with:
+
<font color="#0000c0">#SBATCH -p debug</font>
 +
<font color="#0000c0">#SBATCH -c 2</font>
 +
 +
<font color="#0000c0"># The goal of this example script is to generate a backup of our home directory.</font>
 +
<font color="#0000c0"># So, we begin by setting TARGET equal to our home directory i.e. ~/ . </font>
 +
<font color="#0000c0"># If you re-use this script you should change ~/ to whatever directory you want to backup.</font>
 +
<font color="#008080">TARGET</font>=~/
 
   
 
   
  tar -xvf ''backup_name.tar.gz''
+
<font color="#0000c0"># The following generates a backup directory on '''/scratch/'''. As the compute nodes do not have</font>
 +
<font color="#0000c0"># access to /archive/ '''we'll have to copy the backup to /archive/ after the script is done'''</font>
 +
mkdir&nbsp;<font color="#c000c0">-p</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/
 +
 +
<font color="#0000c0"># The following line will create a backup from the path assigned to TARGET</font>
 +
<font color="#0000c0"># in the above directory; with the year-month-day_hours_minutes_seconds</font>
 +
<font color="#0000c0"># format.&nbsp;&nbsp;This format keeps specific backups seperate from all future and</font>
 +
<font color="#0000c0"># previous backups.</font>
 +
  tar&nbsp;<font color="#c000c0">-cvzf</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/<font color="#c000c0">`date&nbsp;</font><font color="#af5f00">&quot;</font><font color="#c00000">+%Y-%m-%d_%Hh_%Mm_%Ss</font><font color="#af5f00">&quot;</font><font color="#c000c0">`</font>.tar.gz&nbsp;<font color="#c000c0">$TARGET</font>
 +
 
 +
The main issue with tarball backups is you have to regenerate the .tar.gz file every time.
 +
This implies copying every file into the archive!
 +
This also means that you cannot easily cancel and restart the process.
 +
However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup.
 +
Once the job is done you copy the backup in /scratch/$USER/backups/ over to /archive/.
 +
If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.
 +
 
 +
mkdir&nbsp;<font color="#c000c0">-p</font>&nbsp;/archive/<font color="#c000c0">$USER</font>/backups/
 +
cp&nbsp;<font color="#c000c0">-vu</font>&nbsp;/scratch/<font color="#c000c0">$USER</font>/backups/* /archive/<font color="#c000c0">$USER</font>/backups/
  
This will extract the archive into the current working directory '''overwriting files in the directory with their versions in the backup'''.
 
Check the tar man page for more options.
 
  
=== Rsync Backups ===
+
While other methods will re-index every file they will not have to copy every file into a tar ball every time.
 +
This speed loss is offset by the fact that the tar ball is compressed.
 +
Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.
  
The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.
+
In total you have to generate a tar ball, move the tar ball to /archive/ for storage, and then copy it back onto scratch when you'd like to unpack it.
 +
This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed to copying a single file with globus.
  
rsync -avhu ''source_dir/'' ''desination_dir/''
+
You can extract the archive into you current directory with:
  
Will only update files at destination which are older then their source counterparts, meaning that even though your first run of rsync will be as slow/slower than the above tar command, future updates with rsync will be faster.
+
tar&nbsp;<font color="#c000c0">-xvf</font>&nbsp;backup_name.tar.gz
  
Restoring from an rsync back up is simple you just swap the source and destination arguments.
+
This will extract the archive into the current working directory '''overwriting files in the directory with their versions in the backup'''.
 +
Check the tar man page for more options.
 +
Particularly the --skip-old-files option which will not overwrite files, when unpacking, if the file in the tar ball is older than the one in the directory you unpack into

Latest revision as of 14:32, 1 December 2017

There are several ways to generate and maintain backups and depending on what you'd like to use the backups for, you'll prefer some options over the others. The three of the main performance factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive. You must also consider whether you'd only like backups of the current state, or if you'd like distinct backups from say: a day, a week, and a month ago. If you would like help automating the backup process please let us know.

Method Required Effort Full Transfer time Partial Transfer time Restoring 1 file time Checkpointed
tar Moderate BASH scripting 3x 3x 1/2x no
rsync One BASH command 20x 1x ~0x (depends on file) yes
globus Web Interface 10x 1x ~0x (depends on file) yes
runtimes are proportional to one `rsync -avhu src/ dest/`, where src and dest directories already match.

Rsync Backups

The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.

Assuming you have a directory on /archive/ named after your username, to mirror your home directory to /archive/ you would do:

mkdir -p /archive/$USER/backups/home/
rsync -avhu ~/ /archive/$USER/backups/home/

Will only update files at the destination directory which are older than their source counterparts, meaning that even though your first run of rsync will be as slow/slower than a tar backup, future updates with rsync will be faster.

Restoring from an rsync back up is simple you just swap the source and destination arguments, and if you just want to grab a couple files you can copy them back over, like you normally copy files.

Globus

Just select /archive/your_netid/ as the destination in globus transfer, and follow the guide here, noting that you can select multiple files to transfer at once.

Disk Image Style Backups (tar)

While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time. The following example script will backup your home directory and write the backup to /scratch/$USER/backups/, where $USER evaluates to your user name. Be aware that were you to use this script you have to move the backup file from /scratch/ to /archive/, otherwise the backup will get deleted by our scratch clean-up scripts.

#!/bin/bash 
#SBATCH -p debug
#SBATCH -c 2

# The goal of this example script is to generate a backup of our home directory.
# So, we begin by setting TARGET equal to our home directory i.e. ~/ . 
# If you re-use this script you should change ~/ to whatever directory you want to backup.
TARGET=~/

# The following generates a backup directory on /scratch/. As the compute nodes do not have
# access to /archive/ we'll have to copy the backup to /archive/ after the script is done
mkdir -p /scratch/$USER/backups/

# The following line will create a backup from the path assigned to TARGET
# in the above directory; with the year-month-day_hours_minutes_seconds
# format.  This format keeps specific backups seperate from all future and
# previous backups.
tar -cvzf /scratch/$USER/backups/`date "+%Y-%m-%d_%Hh_%Mm_%Ss"`.tar.gz $TARGET

The main issue with tarball backups is you have to regenerate the .tar.gz file every time. This implies copying every file into the archive! This also means that you cannot easily cancel and restart the process. However, it works particularly well with SLURM, as you can just submit a job once a week to do the backup. Once the job is done you copy the backup in /scratch/$USER/backups/ over to /archive/. If we assume you have a directory on /archive/ named after yourself (/archive/$USER/) you would then do the following to copy your backup to the archive directory.

mkdir -p /archive/$USER/backups/
cp -vu /scratch/$USER/backups/* /archive/$USER/backups/


While other methods will re-index every file they will not have to copy every file into a tar ball every time. This speed loss is offset by the fact that the tar ball is compressed. Compressing the tar ball makes it have an easier time travelling through the relatively low bandwidth connection to /archive/.

In total you have to generate a tar ball, move the tar ball to /archive/ for storage, and then copy it back onto scratch when you'd like to unpack it. This does imply a significant amount of overhead if you just want to recover a single file, as you have to copy and unpack the entire backup as opposed to copying a single file with globus.

You can extract the archive into you current directory with:

tar -xvf backup_name.tar.gz

This will extract the archive into the current working directory overwriting files in the directory with their versions in the backup. Check the tar man page for more options. Particularly the --skip-old-files option which will not overwrite files, when unpacking, if the file in the tar ball is older than the one in the directory you unpack into