SLURM Guide

From Storrs HPC Wiki
Jump to: navigation, search
Simple Linux Utility for Resource Management
Author Lawrence Livermore National Laboratory
Website http://slurm.schedmd.com
Source GitHub
Category Job scheduler
Help Man pages
Mailing list
Tutorials


To optimally and fairly use the cluster, all application programs must be run using the job scheduler, SLURM. When you use SLURM's sbatch command, your application program gets submitted as a "job". Please do not run application programs directly from the command-line when you connect to the cluster. Doing so may slow down performance for other users, and your commands will be automatically throttled or terminated. To better understand how applications get submitted as jobs, let's review the difference between login nodes and compute nodes.

Login node
When you connect to the cluster and see [<YourNetID>@cn01 ~], you are connected to a single shared computer with all your fellow users, known as the "login node". The purpose of the "login" node is for you to submit jobs, copy data, edit programs, etc. The programs that are allowed to run on login nodes is listed in our usage policy.
Compute nodes
These computers do the heavy lifting of running your programs. However you do not directly interact with compute nodes. You ask for the scheduler for compute nodes to run your application program using SLURM, and then SLURM will find available compute nodes and run your application program on them.

Job Submission

Your applications are submitted to SLURM using the sbatch command. This is known as submitting a job. The sbatch command takes as an argument a script describing the resources to be allocated and the actual executable to be run on the cluster. The script can be as specific or general as desired, with each argument on a new line preceded by #SBATCH. The script must start with #!/bin/bash.

First, login to the cluster:

$ ssh NetID@login.storrs.hpc.uconn.edu

Use nano or your favorite text editor to create your job submission script. Here is a very simple job example:

[NetID@login1 ~]$ nano myJob.sh
#!/bin/bash
#SBATCH --ntasks=1    # Job only requires 1 CPU core
#SBATCH --time=5      # Job should run for no more than 5 minutes
echo "Hello, World"   # The actual command to run

Save your submission script and then submit your job:

[NetID@login1 ~]$ sbatch myJob.sh 
Submitted batch job 279934

You can view the status of your job with the sjobs command, as described later in this guide. The output of your job will be in the current working directory in a file named slurm-JobID.out, where JobID is the number returned by sbatch in the example above.

[NetID@login1 ~]$ ls *.out
slurm-279934.out

[NetID@login1 ~]$ cat slurm-279934.out 
Hello, World

Job Examples

The HPC cluster is segmented into groups of identical resources called Partitions. All jobs submitted to the cluster run within one of these Partitions. If you do not select a Partition explicitly the scheduler will put your job into the default Partition, which is called general. Each Partition has defined limits for job runtime and core usage, with specific details available on the usage policy page. You can view a list of all partitions and their status by running the sinfo command.

Below are multiple examples for how to submit a job in different scenarios.

Default Partition (general)

The default Partition is named general and is meant for broad use cases (MPI jobs, serial jobs, interactive jobs, etc.). It spans multiple generations of compute nodes. So, please be advised that if you request more than one node, you may receive nodes with different configurations (cores, memory, etc.). Please read the Advanced SLURM Guide for examples of how to ensure your jobs run on a specific node architecture. This partition allows you to request up to 192 cores, and run for up to 12 hours.

The example job below requests 48 CPU cores for two hours, and emails the specified address upon completion:

#!/bin/bash
#SBATCH --partition=general                   # Name of partition
#SBATCH --ntasks=48                           # Request 48 CPU cores
#SBATCH --time=02:00:00                       # Job should run for up to 2 hours (for example)
#SBATCH --mail-type=END                       # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL)
#SBATCH --mail-user=first.last@uconn.edu      # Destination email address

myapp --some-options path/to/app/parameters   # Replace with your application's commands

General Use Partition, with Requeue (general_requeue)

The general_requeue partition makes available unused compute cycles on partitions used by specialty use cases. It currently features 13 nodes from the Haswell architecture (312 cores). Jobs in this partition may be killed at any time, but are then automatically requeued. As a result, there are often available resources in this partition, when others are full. If your job is designed to be able to restart at any point in its execution, this partition is a great way to access available resources. This partition allows you to request up to 192 cores, and run for up to 12 hours.

The example job below requests 48 CPU cores for two hours, and emails the specified address upon completion or upon being requeued:

#!/bin/bash
#SBATCH --partition=general_requeue           # Name of Partition
#SBATCH --requeue                             # Allow the job to be requeued, if preempted
#SBATCH --ntasks=48                           # Maximum CPU cores for job
#SBATCH --nodes=2                             # Ensure all cores are from whole nodes
#SBATCH --time=02:00:00                       # Job should run for up to 2 hours (for example)
#SBATCH --mail-type=END,REQUEUE               # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL)
#SBATCH --mail-user=first.last@uconn.edu      # Destination email address

myapp --some-options path/to/app/parameters   # Replace with your application's commands

Test/debug Partition (debug)

The Partition named debug is optimized for very short running jobs for the purpose of testing, debugging, or compiling. It spans multiple generations of compute nodes. Please read the Advanced SLURM Guide for examples of how to ensure your jobs run on a specific node architecture. This Partition allows you to request up to 24 cores, and run for up to 30 minutes.

#!/bin/bash
#SBATCH --partition=debug                     # Name of Partition
#SBATCH --ntasks=4                            # Maximum CPU cores for job
#SBATCH --nodes=1                             # Ensure all cores are from the same node
#SBATCH --time=5                              # Job should run for up to 5 minutes (for example)
#SBATCH --mail-type=END                       # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL)
#SBATCH --mail-user=first.last@uconn.edu      # Destination email address

myapp --some-options path/to/app/parameters   # Replace with your application's commands

MPI-optimized Partition (parallel)

The Partition named parallel is optimized for jobs requiring multiple nodes that communicate with MPI. All of the nodes in this Partition are configured with identical resources (cores, memory, etc.). This Partition allows you to request up to 384 cores, and run for up to 6 hours.

#!/bin/bash
#SBATCH --partition=parallel                         # Name of Partition
#SBATCH --ntasks=240                                 # Request 256 CPU cores
#SBATCH --time=01:30:00                              # Job should run for up to 1.5 hours (for example)
#SBATCH --mail-type=END                              # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL)
#SBATCH --mail-user=first.last@uconn.edu             # Destination email address

mpirun myapp --some-options path/to/app/parameters   # Replace with your application's commands

Single Node Partitions (serial)

The Partitions name serial is optimized for small jobs, requiring from 1 to 24 cores. It is not appropriate for MPI jobs. This partition may span multiple generations of compute nodes. Please read the Advanced SLURM Guide for examples of how to ensure your jobs run on a specific node architecture.

The serial partition allow you to request up to 24 cores, and run for up to 7 days.

The below example is for using the serial partition:

#!/bin/bash
#SBATCH --partition=serial                    # Name of Partition
#SBATCH --ntasks=24                           # Maximum CPU cores for job
#SBATCH --nodes=1                             # Ensure all cores are from the same node
#SBATCH --time=02-00:00:00                    # Job should run for up to 2 days (for example)
#SBATCH --mail-type=END                       # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL)
#SBATCH --mail-user=first.last@uconn.edu    # Destination email address

myapp --some-options path/to/app/parameters   # Replace with your application's commands

Run one multi-threaded program per node

In some cases, one is running a multi-threaded program that needs to be spawned on each node. We need the program to have access to all available CPU cores on that node. We can use the srun process manager to allocate all of our CPU cores to run our program, and to ensure all the CPUs of a given node are allocated to the program on that node, use the --cpu_bind=boards option:

#!/bin/bash
#SBATCH --partition=general
#SBATCH --ntasks=30

srun \
    --nodes $SLURM_NNODES \
    --ntasks $SLURM_NNODES \
    --cpu_bind=boards \
    sh -c 'echo $(hostname -s):$(nproc)'

The nproc value is actually the number of CPUs available from cgroups.

Of course, replace sh -c ... segment with your multi-threaded program command.

Interactive Jobs

If you require an interactive job, use the fisbatch command instead of sbatch. This command does not use a submission script. Instead, all of the options from the submission script are given on the command line, without the #SBATCH keyword.

For example:

$ fisbatch --ntasks=12 --nodes=1        #Starts an interactive shell using 12 CPUs on 1 node  

NOTE: please don't forget to exit when you finish your job. And, although many programs have a graphical interface, we recommend that all jobs use a command-line interface if supported by the application.

To use a custom partition with interactive jobs, specify the --partition parameter:

$ fisbatch --partition=serial ...

Re-attach the Interactive Jobs

If you suddenly lost the connection from the interactive screen, you can try the following command to link back.

NOTE: it is not guaranteed to access the screen successfully everytime. If it fails, it means this interactive job is not accessable anymore. Please scancel it.

First you need to get the JobID of the FISBATCH job:

$ sjobs
       JobID  Partition        QOS    JobName      User              Submit      State    Elapsed   NNodes      NCPUS        NodeList               Start
------------ ---------- ---------- ---------- --------- ------------------- ---------- ---------- -------- ---------- --------------- -------------------
890246          general    general   FISBATCH  tix11001 2017-04-23T14:42:29    RUNNING   00:04:37        1         16            cn69 2017-04-23T14:42:30  None

Then, you can re-attach the job by JobID:

$ fisattach 890246
FISBATCH -- attaching for JOBID 890246
!
FISBATCH -- Connecting to head node (cn69)

Specify a Job Runtime

It is always recommended to specify a job time limit using the --time parameter in your submission script. If you do not specify this parameter, the scheduler assumes that your job will take the maximum allowable time for the Partition on which you're running. For example, if SLURM calculates that a given compute node will be idle for four hours, and your job specifies --time-02:00:00, then your job will be allowed to run. If you didn't specify this parameter with your submission, then your job would continue to wait for available resources. If you do specify this parameter but your job doesn't complete within the specified time, your job will be cancelled with a state of TIMEOUT.

Here are some examples for using the --time parameter:

#SBATCH --time=01-02:03:04    # Job will run at most 1day 2hours 3mins and 4seconds
#SBATCH --time=01:02:03       # Job will run at most 1hour 2mins 3seconds
#SBATCH --time=01:02          # Job will run at most 1min 2seconds
#SBATCH --time=1              # Job will run at most 1 min

You can learn more details about this feature in the sbatch manual:

$man sbatch

Then search for --time.

Checking the Status of a Job

To view your active jobs:

sjobs              # View all jobs
sjobs -j {JOBID}   # View a specific job

Alternatively, the squeue command may be more descriptive for jobs in a PENDING state:

squeue -u `whoami`

To view your job history:

shist              # View all jobs
shist -j {JOBID}   # View a specific job

To view all the jobs in the cluster:

squeue -a

To review the hosts usage:

sinfo

To review all the job logs:

slogs

How to Terminate a Job

To terminate a single job:

scancel {JOBID}

To terminate all of your jobs:

scancel -u `whoami`

Priority Users

If you have been granted access to priority resources through our condo model then you need to submit your jobs using a custom Partition in order to avoid resource limits.

For example:

#SBATCH --partition=HaswellPriority

The table below lists the different priority Partitions, and the faculty who have access.

Partition Name Faculty
WestmerePriority Wagstrom
SandyBridgePriority Astitha, Dongare, Nakhmanson
IvyBridgePriority Astitha, May
HaswellPriority Brown, Li, Anagnostou, Zhao, Rajasekaran, Ramprasad, Alpay, Dept. of Economics
Haswell192Priority Dongare
Broadwell44Priority Mellor