Glossary

From Storrs HPC Wiki
Jump to: navigation, search

Throughout our documentation wiki you will find many terms related to the use of High Performance Computing (HPC). This glossary of terms will help you to better understand many of the concepts which make HPC unique from other technologies or scientific instrumentation.

General Definitions

cluster – A collection of technology components – including servers, networks, and storage – deployed together to form a platform for scientific computation.

processor – A server in the HPC environment contains two physical processors. Each processor is sometimes referred to as a socket, chip, or Central Processing Unit (CPU). Each of the two processors contains many individual cores. The two processors are connected to each other by a high-speed bus, as well as to other system components (memory, data storage, networks) that are local to that server.

core – A processor contains multiple cores, each of which can be used to execute instructions from a computational job.

compute node - A compute node has two physical processors, each of which contains a finite number of cores. These cores are the unit of measurement used by the cluster's job scheduler for allocating resources. A node also contains a finite amount of random access memory (RAM), high-speed flash storage, and network interconnects.

GPU – A graphical processing unit (GPU) is a type of processor that is specially designed to handle computational tasks that are common in graphically intensive applications, such as highly parallelized matrix and vector operations.

Infiniband – A high-speed, low-latency network that connects all compute nodes to each other and to data storage. This network is sometimes referred to as a fabric. It enables independent compute nodes to communicate with each other much faster than a traditional network, enabling computational jobs that span multiple servers to operate more efficiently, often through a technology known as Message Passing Interface (MPI).

Condo_Model

Condo Model

The Storrs High Performance Computing (HPC) cluster is delivered under a business model known as the "condo model". The document below serves as a glossary of terms related to this business model so that researchers have a clear understanding of its implications.

condo model - A financial model for managing HPC resources where faculty researchers fund the capital equipment costs of individual compute nodes, while the university funds the operating costs of running these nodes for five years.

priority compute node - The smallest unit of resources which can be purchased by a researcher is an individual server, known as a compute node, as described above in the General Definitions. The cost of a compute node includes associated capital equipment costs associated with deploying that node, including external network and fabric interconnects, power delivery, software licenses, etc.

operating costs - The university centrally operates all compute nodes through University Information Technology Services (UITS) so that researchers can focus on science, and not systems' management. The costs to operate the cluster are split between operating costs and capital costs. Operating costs refers to all non-equipment costs, such as staff salaries and contracts for equipment maintenance.

capital costs – The actual physical equipment that the university purchases to operate HPC, including servers, switches, storage, etc. Most capital equipment carries is deployed for a period of five years.

priority users - Faculty who purchase compute nodes receive access to equivalent resources at a higher priority than other researchers. The faculty can designate others to receive access at the same priority level, such as their graduate students, postdoctoral researchers, etc. With priority access, computational jobs are moved to the front of the queuing system and are guaranteed to begin execution within twelve hours. A priority user can utilize their resources indefinitely. All access to resources is managed through the cluster's job scheduler. Users do not receive direct access to compute nodes or privileged access ("root").

open access users - Any user utilizing HPC resources who has not contributed to the purchase of those resources. A computational job submitted by a regular user is placed into a queue until sufficient resources are available. This queue is prioritized by multiple factors, including a "fair-share" score. A user who has priority access to a subset of resources is considered a regular user on all other resources to which they did not contribute.

job prioritization - There are multiple factors which determine the assigned priority for any given computational job. The factor with the most weight is the partition to which the job is submitted. Partitions used by priority users receive the highest priority. The second most important factor for job scheduling is a users' fair-share score. A fair-share score is assigned to each user by the job scheduler. This score helps to ensure that HPC resources are used equitably by all non-priority users. If a user has executed a disproportionally large number of jobs recently, they may temporarily receive a lower priority than a user who has not executed jobs recently. The third factor that is used in calculating job priority is the age of your job. The longer a job sits in the queue, the higher its priority grows. If every job has the same priority, then the scheduling method becomes "first in, first out".

job preemption - There is one partition, named general_requeue, in which jobs may be canceled at any time. If a job is canceled in this partition, the job scheduler will automatically resubmit that job. Jobs using this partition should be short running, or take steps to frequently save their state so that they can be resumed when they are requeued.