Difference between revisions of "Data Storage Guide"

From Storrs HPC Wiki
Jump to: navigation, search
m (Generating Backups: Minor change to title to make explicit mention of storing on /archive/)
(HPC Storage (short term))
 
(16 intermediate revisions by 4 users not shown)
Line 8: Line 8:
 
! Name          !! Path                          !! Size                  !! Relative Performance !! Persistence  !! Backed up? !! Purpose
 
! Name          !! Path                          !! Size                  !! Relative Performance !! Persistence  !! Backed up? !! Purpose
 
|-
 
|-
| Scratch          || <code>/scratch</code> || 1PB shared    || Fastest        || '''None, deleted after 30 days''' || No        || Fast parallel storage for use during computation
+
| Scratch          || <code>/scratch</code> || 1PB shared    || Fastest        || '''None, deleted after 60 days''' || No        || Fast parallel storage for use during computation
 
|-
 
|-
 
| Node-local || <code>/work</code>            || 40GB          || Fast          || '''None, deleted after 5 days'''  || No        || Fast storage local to each compute node, globally accessible from <code>/misc/cnXX</code>
 
| Node-local || <code>/work</code>            || 40GB          || Fast          || '''None, deleted after 5 days'''  || No        || Fast storage local to each compute node, globally accessible from <code>/misc/cnXX</code>
 
|-
 
|-
| Home          || <code>~</code>                || 50GB         || Slow      || Yes    || Twice per week        || Personal storage, available on every node
+
| Home          || <code>~</code>                || 300GB         || Fast    || Yes    || Once per week        || Personal storage, available on every node
 
|-
 
|-
| Group        || <code>/shared</code>          || [[:Category:Help|By request]] || Slow || Yes    || Twice per week        || Short term group storage for collaborative work
+
| Group        || <code>/shared</code>          || [[:Category:Help|By request]] || Fast || Yes    || '''No'''      || Short term group storage for collaborative work
 
|}
 
|}
  
Line 20: Line 20:
 
* Data deletion inside the scratch folder is based on directory modification time. You will get 3 warnings by email before deletion.
 
* Data deletion inside the scratch folder is based on directory modification time. You will get 3 warnings by email before deletion.
 
* Certain directories are only mounted on demand by <code>autofs</code>. These directories are: <code>/home</code>, <code>/shared</code>, and <code>/misc/cnXX</code>. If you try to use shell commands like <code>ls</code> on these directories they may fail. They are only mounted when an attempt is made to access a file under the directory, or using <code>cd</code> to enter the directory structure.
 
* Certain directories are only mounted on demand by <code>autofs</code>. These directories are: <code>/home</code>, <code>/shared</code>, and <code>/misc/cnXX</code>. If you try to use shell commands like <code>ls</code> on these directories they may fail. They are only mounted when an attempt is made to access a file under the directory, or using <code>cd</code> to enter the directory structure.
* You can [[recover deleted files|recover files on your own from our backed up directories]] using snapshots within 2 weeks.
+
 
 
* You can check on your [[Cannot write to home directory|home directory quota]].
 
* You can check on your [[Cannot write to home directory|home directory quota]].
 
* There are read-only datasets available at <code>/scratch/shareddata</code>. More information is available [[Shared_Datasets|on this page]].
 
* There are read-only datasets available at <code>/scratch/shareddata</code>. More information is available [[Shared_Datasets|on this page]].
  
= Permanent Data Storage (long term) =
+
= Long Term Data Storage =
 
 
Once data is no longer needed for computation, it should be transferred off of the cluster to a permanent data storage location. To meet this need, the university offers a data archival service that features over three petabytes of capacity. Data transfer to permanent locations should be done via the web-based [[Globus_Connect|Globus]] service.
 
 
 
{| class="wikitable"
 
! Name !! Path !! Size !! Relative Performance !! Resiliency !! Purpose
 
|-
 
|[http://hpc.uchc.edu/cloud-storage/ Archival cloud storage] || <code>/archive</code> || 3PB shared || Low || Data is distributed across three datacenters between the Storrs and Farmington campuses || This storage is best for permanent archival of data without frequent access. NOTE: Users must request access to this resource by either [http://becat.uconn.edu/support/hpc-support/ creating a ticket] or emailing [mailto:hpc@uconn.edu hpc@uconn.edu].
 
|-
 
|[http://uits.uconn.edu/disk-storage-info UITS Research Storage] || [[File_transfer_via_SMB|Use smbclient to transfer files]] || [http://uits.uconn.edu/disk-storage-info By request to UITS] || Moderate || Data is replicated between two datacenters on the Storrs campus || This storage is best used for long term data storage requiring good performance, such as data that will be accessed frequently for post-analysis.
 
|-
 
|Departmental/individual storage || [[File_transfer_via_SMB|Use smbclient to transfer files]] or [[File_transfer_between_hosts|SCP utilities]] || - || - || - || Some departments and/or individual researchers have their own local network storage options. These can be accessed using [[File_transfer_via_SMB|SMB Client]] or [[File_transfer_between_hosts|SCP utilities]].
 
|}
 
 
 
= Generating Backups for Storage on /archive/ =
 
 
 
There are several ways to generate and maintain backups, which should be stored on /archive/, and depending on what you'd like to use the backups for, you'll prefer some options over the others.
 
Three of the main factors you have to weigh are time to: generate an archive, transfer said archive to permanent storage, and restore from this archive.
 
You must also consider whether you'd only like the most recent backup of the current state, which is what you'd be doing to make sure data on /scratch is resistant to file system failure.
 
Or, if you'd like older versions to exist, so that if you obliterate some of your results you can just restore from a backup.
 
We keep backups like this for the system configuration directories, and your home directories.
 
 
 
The backup process can be automated, please contact us if you need help automating any of the following options.
 
 
 
=== Disk Image Style Backups ===
 
 
 
While you will not be creating an actual disk image, the idea here is to generate an exact copy of the state of all files within a directory at a certain time.
 
 
 
tar -czf ''backup_name.tar.gz'' ''directory_to_make_backup_of/''
 
 
 
The main issue with generating backups this way is you have to regenerate the .tar.gz file every time which involves re-indexing and copying of every file.
 
While other methods will re-index they likely will not have to copy all the files over.
 
Though you are moving extra copies to archive with this technique, it can be useful if cluster usage is low, since you can generate the tar ball in an sbatch script.
 
And, as the tar ball is compressed, it will have an easier time travelling through the low bandwidth pipe to /archive/.
 
 
 
You should move the archive to /archive/ for storage, and copy it back onto scratch when you'd like to unpack it.
 
You can extract the archive with:
 
 
tar -xvf ''backup_name.tar.gz''
 
 
 
This will extract the archive into the current working directory '''overwriting files in the directory with their versions in the backup'''.
 
Check the tar man page for more options.
 
 
 
=== Rsync Backups ===
 
  
The goal of an rsync based backup is to get the appearance of a mirrored file system, ie the source and destination directories look the same.
+
Once data is no longer needed for computation, it should be transferred off of
 +
the cluster to a permanent data storage location.  ''Do not use the scratch file system (/scratch) for long-term storage''; it is optimized for fast parallel access from multiple computers, and is too scarce and too expensive for long-term storage.
  
  rsync -avhu ''source_dir/'' ''desination_dir/''
+
If you need more storage than is provided by your /home directory (or /shared
 +
directory for those groups that use them), then use the /archive file system.
 +
This is a relatively slow but reliable file system. It is protected by being
 +
''geo-spread'' between three locations (one in Storrs, and two in Farmington),
 +
and your data can survive the loss of any one location.
  
Will only update files at destination which are older then their source counterparts, meaning that even though your first run of rsync will be as slow/slower than the above tar command, future updates with rsync will be faster.
+
You must request /archive storage before you use it.  To do so,
 +
send an email to
 +
[mailto:hpc@uconn.edu?Subject=Request%20for%20HPC%20Archive%20Storage hpc@uconn.edu]
 +
requesting an archive folder for yourself, or if you need an /archive folder for your group,
 +
include the group name.
  
Restoring from an rsync back up is simple you just swap the source and destination arguments.
+
Once you have obtained access to the archive system, you can transfer
 +
data two ways.  The slow way uses the standard Unix
 +
utilities (such as cp, tar, etc) run on the HPC nodes, and is suitable only for small
 +
transfers.  The fast way uses the [[Globus_Connect|Globus]] service. Globus is
 +
about two to five times faster (depending on system traffic), and it should be used for large transfers.
  
= Data Transfers using Globus Connect =
+
For information on how to best organize your backups, see our page on
You can make large data transfers with a service called [[Globus Connect|Globus Connect]]. This allows you to transfer large data sets between the Storrs HPC and your workstation, or any computer set up as a Globus endpoint. The Globus system is optimized for long-distance data transfers and is particularly useful for sharing data with your collaborators at other institutions.
+
[[Backing_Up_Your_Data|Backing Up Your Data]].

Latest revision as of 09:43, 18 August 2020

The Storrs HPC cluster has a number of data storage options to meet various needs. There is a high-speed scratch file system, which allows parallel file writing from all compute nodes. All users also get a persistent home directory, and groups of users can request private shared folders. Once data is no longer needed for computation, it should be transferred off of the cluster to a permanent data storage location. To meet this need, the university offers a data archival service that features over three petabytes of capacity. Data transfer to permanent locations should be done via the web-based Globus service.

HPC Storage (short term)

The Storrs HPC cluster has a number of local high performance data storage options available for use during job execution and for the short term storage of job results. None of the cluster storage options listed below should be considered permanent, and should not be used for long term archival of data. Please see the next section below for permanent data storage options that offer greater resiliency.

Name Path Size Relative Performance Persistence Backed up? Purpose
Scratch /scratch 1PB shared Fastest None, deleted after 60 days No Fast parallel storage for use during computation
Node-local /work 40GB Fast None, deleted after 5 days No Fast storage local to each compute node, globally accessible from /misc/cnXX
Home ~ 300GB Fast Yes Once per week Personal storage, available on every node
Group /shared By request Fast Yes No Short term group storage for collaborative work

Notes

  • Data deletion inside the scratch folder is based on directory modification time. You will get 3 warnings by email before deletion.
  • Certain directories are only mounted on demand by autofs. These directories are: /home, /shared, and /misc/cnXX. If you try to use shell commands like ls on these directories they may fail. They are only mounted when an attempt is made to access a file under the directory, or using cd to enter the directory structure.

Long Term Data Storage

Once data is no longer needed for computation, it should be transferred off of the cluster to a permanent data storage location. Do not use the scratch file system (/scratch) for long-term storage; it is optimized for fast parallel access from multiple computers, and is too scarce and too expensive for long-term storage.

If you need more storage than is provided by your /home directory (or /shared directory for those groups that use them), then use the /archive file system. This is a relatively slow but reliable file system. It is protected by being geo-spread between three locations (one in Storrs, and two in Farmington), and your data can survive the loss of any one location.

You must request /archive storage before you use it. To do so, send an email to hpc@uconn.edu requesting an archive folder for yourself, or if you need an /archive folder for your group, include the group name.

Once you have obtained access to the archive system, you can transfer data two ways. The slow way uses the standard Unix utilities (such as cp, tar, etc) run on the HPC nodes, and is suitable only for small transfers. The fast way uses the Globus service. Globus is about two to five times faster (depending on system traffic), and it should be used for large transfers.

For information on how to best organize your backups, see our page on Backing Up Your Data.