Advanced SLURM

From Storrs HPC Wiki
Revision as of 10:42, 20 June 2017 by Lwm14001 (talk | contribs) (Ivy Bridge: cn321-324 lack entries in /etc/hosts causing previous exclusion not to work)
Jump to: navigation, search

Job arrays

Job arrays allow you to run a single submission file up to 1000 times. You can check the exact limit using scontrol:

 scontrol show config | grep MaxArraySize
 # MaxArraySize            = 1001

Using job arrays are equivalent to submitting your job many times and seeing which iteration is being run. To use arrays add one line to your job:

# Run only 4 jobs at a time
#SBATCH --array=0-50%4
# Run all possible jobs at once
#SBATCH --array=0-50

SLURM generates an environmental variable $SLURM_ARRAY_TASK_ID that you can pass on to your program to, say, use different sets of the inputs. This means your program should have a saved list of inputs of files, and be able to choose a file or set of variables with the number set by $SLURM_ARRAY_TASK_ID. You can read more about job arrays in the sbatch manual.

If you are not running your job in a priority partition, you will find that our 8 concurrent job limit makes job arrays for situations where you have low resource usage. For example, if each of your jobs only needs one CPU core, using job arrays limits you to only using 8 CPUs as opposed to your 392 core allowance. In such high throughput situations you can use GNU Parallel which we have integrated with SLURM.

If you have a priority account, you have unlimited concurrent jobs and can make full use of job arrays. If you are competing with other users in your group for compute time, you may be better off submitting a job with a larger resource allocation and using GNU Parallel to your program multiple times across all nodes in a single job. An additional advantage of GNU Parallel is it allows you to resume your job where you left off using the --joblog option.

Job dependencies

There are several reasons for using job dependencies:

  • Run the same job multiple times, if the program can resume "checkpointed" computations.
  • Run jobs that should require data from previous jobs.

Generally, you will find the --dependency=singleton option most useful, which requires the submitted jobs to have the same name:

#SBATCH --name=singleton  # Change this name to something more appropriate to your computation
#SBATCH --dependency=singleton

If you really want to use other types of dependencies, you would need to use an additional script to run sbatch to submit your job, read the job ID from the submitted job, and apply the dependency to the subsequent job submission. This is more tedious that other job schedulers which can use the job name, but SLURM's way is more robust. You can read more about job dependencies in the sbatch manual.

Resuming jobs

SLURM has a checkpoint/restart feature which is intended to save a job state to disk as a checkpoint and resume from a saved checkpoint. This is done using the BLCR library which is installed on all our nodes. Intel MPI, versions 2013 and later support the BLCR checkpoint/restart library. Intel MPI is available on our cluster as the intelics module, therefore see the Checkpoint/Restart options from mpiexec.hydra -h; this needs I_MPI_FABRICS=ofa.

We don't have a lot of personal experiencing using the checkpoint/restart. You may need to apply the feature only at the mpirun/mpiexec level, for example, with:


...or instead apply them at the SLURM level with:

#SBATCH --checkpoint

Or apply the options in both places. You will have to see what works for your program. Definitely get in touch with us if you are experimenting with this feature and need help, as the BLCR kernel module saves log files only visible to administrators.

Jobs in the Wild West

If the particularly cluster is busy, and your job can finish in less than a day, you can try running your job in the general_requeue partition. Daily jobs are run in this special partition that will kill (or, more accurately, "preempt") any existing jobs there. To re-run your job if it is killed you can add the --requeue flag. The --requeue option is also useful for long jobs where a node might fail. If your job fails for any other reason, it will not be requeued; for that type of situation, you would need to setup a job dependency as explained in the previous section.

Running jobs at particular times

We don't allow crontab on the login nodes. You can use the --begin option in sbatch to set a specific start time (and date) you want the job to start.

Remotely running commands

If you need to execute a command at a specific time on a login node of the cluster, you can setup SSH keys to login to the cluster from another computer and run the command.

Cancel All of Your Jobs

If you have a large number of jobs to cancel it would be tedious to individually cancel them with scancel. This short command will cancel all of your jobs, so use it carefully.

scancel -u $USER

Targeting Specific Node Architectures

Some of the SLURM partitions span multiple generations of hardware architecture, specifically: general, serial, and debug. In some circumstances you may want to ensure that your jobs run on a specific node architecture, such as:

  • MPI jobs may perform significantly better using homogenous nodes which are tightly coupled with Infiniband
  • Some applications may have compiler optimization flags specific to CPU's built-in instructions, such as hardware AES encryption
  • You may experience non-deterministic runtimes from one job to another if each runs on a different architecture


  • None of the priority Partitions span multiple hardware architectures.
  • We are frequently adding new nodes, and occasionally removing old nodes, so the instructions below may change frequently. If you are experiencing an issue with job targeting, please refer back to this page and ensure that your specific command is still accurate.
  • Click here to view a comparison chart of the different CPU types on our compute nodes.

SLURM's --exclude parameter is used to target a given job to a specific hardware architecture.


To target your job at the Broadwell architecture, which has 4 nodes, each with two Intel E5-2699 V4 CPUs and 256GB of RAM:

#SBATCH --exclude=cn[01-324]


To target your job at the Haswell architecture, which has 175 nodes, each with two Intel E5-2690 V3 CPUs and 128GB of RAM:

#SBATCH --exclude=cn[01-136,325-328]

Eight of the Haswell nodes have a higher amount of RAM, 192GB. These are available in the general_requeue partition. To target these nodes:

#SBATCH --partition=general_requeue
#SBATCH --exclude=cn[01-256,265-328]

Ivy Bridge

To target your job at the Ivy Bridge architecture, which has 32 nodes, each with two Intel E5-2680 V2 CPUs and 128GB of RAM:

#SBATCH --exclude=cn[01-104,137-320,325-328]

Sandy Bridge

To target your job at the Sandy Bridge architecture, which has 40 nodes, each with two Intel E5-2650 CPUs and 64GB of RAM:

#SBATCH --exclude=cn[01-64,105-320,325-328]