Advanced SLURM

From Storrs HPC Wiki
Jump to: navigation, search

Job arrays

Job arrays allow you to run a single submission file up to 1000 times. You can check the exact limit using scontrol:

 scontrol show config | grep MaxArraySize
 # MaxArraySize            = 10001

Using job arrays are equivalent to submitting your job many times and seeing which iteration is being run. To use arrays add one line to your job:

# Run only 4 jobs at a time
#SBATCH --array=0-50%4
# Run all possible jobs at once
#SBATCH --array=0-50

SLURM generates an environmental variable $SLURM_ARRAY_TASK_ID that you can pass on to your program to, say, use different sets of the inputs. This means your program should have a saved list of inputs of files, and be able to choose a file or set of variables with the number set by $SLURM_ARRAY_TASK_ID. You can read more about job arrays in the sbatch manual.

If you are not running your job in a priority partition, you will find that our 8 concurrent job limit makes job arrays for situations where you have low resource usage. For example, if each of your jobs only needs one CPU core, using job arrays limits you to only using 8 CPUs as opposed to your 392 core allowance. In such high throughput situations you can use GNU Parallel which we have integrated with SLURM.

If you have a priority account, you have unlimited concurrent jobs and can make full use of job arrays. If you are competing with other users in your group for compute time, you may be better off submitting a job with a larger resource allocation and using GNU Parallel to your program multiple times across all nodes in a single job. An additional advantage of GNU Parallel is it allows you to resume your job where you left off using the --joblog option.

Job dependencies

There are several reasons for using job dependencies:

  • Run the same job multiple times, if the program can resume "checkpointed" computations.
  • Run jobs that should require data from previous jobs.

Generally, you will find the --dependency=singleton option most useful, which requires the submitted jobs to have the same name:

#SBATCH --name=singleton  # Change this name to something more appropriate to your computation
#SBATCH --dependency=singleton

If you really want to use other types of dependencies, you would need to use an additional script to run sbatch to submit your job, read the job ID from the submitted job, and apply the dependency to the subsequent job submission. This is more tedious that other job schedulers which can use the job name, but SLURM's way is more robust. You can read more about job dependencies in the sbatch manual.

Resuming jobs

SLURM has a checkpoint/restart feature which is intended to save a job state to disk as a checkpoint and resume from a saved checkpoint. This is done using the BLCR library which is installed on all our nodes. Intel MPI, versions 2013 and later support the BLCR checkpoint/restart library. Intel MPI is available on our cluster as the intelics module, therefore see the Checkpoint/Restart options from mpiexec.hydra -h; this needs I_MPI_FABRICS=ofa.

We don't have a lot of personal experiencing using the checkpoint/restart. You may need to apply the feature only at the mpirun/mpiexec level, for example, with:

 mpiexec.hydra --YOUR_OPTIONS_HERE YOUR_PROGRAM

...or instead apply them at the SLURM level with:

#SBATCH --checkpoint
#SBATCH--checkpoint-dir

Or apply the options in both places. You will have to see what works for your program. Definitely get in touch with us if you are experimenting with this feature and need help, as the BLCR kernel module saves log files only visible to administrators.

Jobs in the Wild West

If the particularly cluster is busy, and your job can finish in less than a day, you can try running your job in the general_requeue partition. Daily jobs are run in this special partition that will kill (or, more accurately, "preempt") any existing jobs there. To re-run your job if it is killed you can add the --requeue flag. The --requeue option is also useful for long jobs where a node might fail. If your job fails for any other reason, it will not be requeued; for that type of situation, you would need to setup a job dependency as explained in the previous section.

Running jobs at particular times

We don't allow crontab on the login nodes. You can use the --begin option in sbatch to set a specific start time (and date) you want the job to start.

Remotely running commands

If you need to execute a command at a specific time on a login node of the cluster, you can setup SSH keys to login to the cluster from another computer and run the command.

Cancel All of Your Jobs

If you have a large number of jobs to cancel it would be tedious to individually cancel them with scancel. This short command will cancel all of your jobs, so use it carefully.

scancel -u $USER

Submitting a Collection of Programs to Run

If you would like to submit a job which is a collection of completely discrete program runs you can use srun. By discrete, we require that they could run simultaneously without any problems. Using srun will run the program once for each task. Meaning, that if your tasks are distributed over multiple nodes, your program will be run on those other nodes as well as the main node.

If you have a single threaded/non-MPI simulation which you'd like to run some number of times you can do:

#!/bin/bash
#SBATCH --partition=general
#SBATCH --ntasks=30
srun -lE hostname

Where the -E tells srun to use the ntasks and nnodes specification from the sbatch allocation. The -l implies prepending output lines with the task number. To run your program you would replace hostname with the program you wish to run.

Another common use case is to run multiple instances of a program, but with different arguments. For this, srun provides the --multi-prog flag, which you can use to provide essentially any argument and program combination. --multi-prog takes a mapping file as it's argument, not a command:

#!/bin/bash
#SBATCH --partition=general
#SBATCH --ntasks=30
echo "__MP1__"
srun -E -l --multi-prog MP1
echo "__MP2__"
srun -E -l --multi-prog MP2

The mapping files MP1 and MP2 for the above commands are:

# MP1
*    echo %t is the task number
# MP2
0    hostname
1    echo task number %t
2-9  echo offset: %o
10-29 echo "This is just another program to run"

Within the upper block, which is the sbatch script, you can see both multi-prog files are called from within sbatch's resource allocation. In the multi-prog files the first column is used to match task numbers. When you run in multi-prog mode it will generate ntasks tasks each with its own task number, so if you specify 30 tasks youll get 30 tasks with numbers 0,1,2...29. If an individual task matches the number in the left column it will run the command in the right column.

In the first mapping file, MP1, the * is a special character which matches all task numbers and then assigns all tasks to do the command echo. echo is passed the argument %t, which is replaced with the task number at run time. Here we are only printing the task number, but you could use it select an item from a list of jobs to do, a file name, or even a starting value for a variable. Using the task number as a variable's value can be extremely useful as it allows you to take for loops and execute each step independently, by assigning the loop variable to one of the command line arguments. In bash this would like something like:

#!/bin/bash
# Original Script 
for i in {0..23}
do
    echo $i
done
#!/bin/bash
# modified script named: new.bash
i=$1     #The $1 means first command line argument
echo $i
# MP.new
* ./new.bash %t

Then you can deploy the latter after setting new.bash as executable with:

srun -n 24 -N 2 -p general --multi-prog MP.new

You should be aware by now that doing things in parallel will result in out of order execution, so the for loop cannot depend on previous values.

Another thing to be aware of with multi-prog is you should avoid scheduling programs with significantly different run times together as it will delay yours' and other's start times. Since scheduling a short program and a long program together will force the short one to hold onto its resource allocation until the long one finishes, leaving resources idle.

Targeting Specific Node Architectures

Some of the SLURM partitions span multiple generations of hardware architecture, specifically: general, serial, and debug. In some circumstances you may want to ensure that your jobs run on a specific node architecture, such as:

  • MPI jobs may perform significantly better using homogenous nodes which are tightly coupled with Infiniband
  • Some applications may have compiler optimization flags specific to CPU's built-in instructions, such as hardware AES encryption
  • You may experience non-deterministic runtimes from one job to another if each runs on a different architecture

Notes:

  • None of the priority Partitions span multiple hardware architectures.
  • We are frequently adding new nodes, and occasionally removing old nodes, so the instructions below may change frequently. If you are experiencing an issue with job targeting, please refer back to this page and ensure that your specific command is still accurate.
  • Click here to view a comparison chart of the different CPU types on our compute nodes.

SLURM's --exclude parameter is used to target a given job to a specific hardware architecture.

Broadwell

To target your job at the Broadwell architecture, which has 4 nodes, each with two Intel E5-2699 V4 CPUs and 256GB of RAM:

#SBATCH --exclude=cn[65-320,329-343,345-353,355-358,360-364,369-398,400],gpu[07-10]

Haswell

To target your job at the Haswell architecture, which has 175 nodes, each with two Intel E5-2690 V3 CPUs and 128GB of RAM:

#SBATCH --exclude=cn[65-136,325-343,345-353,355-358,360-364,369-398,400],gpu[07-10]

Eight of the Haswell nodes have a higher amount of RAM, 192GB. These are available in the general_requeue partition. To target these nodes:

#SBATCH --partition=general_requeue
#SBATCH --exclude=cn[65-256,265-320,325-343,345-353,355-358,360-364,369-398,400],gpu[07-10]

Ivy Bridge

To target your job at the Ivy Bridge architecture, which has 32 nodes, each with two Intel E5-2680 V2 CPUs and 128GB of RAM:

#SBATCH --exclude=cn[65-104,137-320,325-343,345-353,355-358,360-364,369-398,400],gpu[07-10]

Sandy Bridge

To target your job at the Sandy Bridge architecture, which has 40 nodes, each with two Intel E5-2650 CPUs and 64GB of RAM:

#SBATCH --exclude=cn[105-320,325-343,345-353,355-358,360-364,369-398,400],gpu[07-10]

Skylake

To target your job at the Skylake architecture, which has 14 nodes, each with two Intel Xeon Gold-6150 CPUs and 192GB of RAM:

#SBATCH --exclude=cn[65-136,153-256,265-320,325-328]

Viewing specific job ranks and fair share scores

sranks

To view specific job rank details for a single user:

sranks | grep netidofuser