Difference between revisions of "Advanced SLURM"
(→Haswell) |
(→Ivy Bridge) |
||
Line 215: | Line 215: | ||
To target your job at the ''Ivy Bridge'' architecture, which has 32 nodes, each with two [http://ark.intel.com/products/75277 Intel E5-2680 V2] CPUs and 128GB of RAM: | To target your job at the ''Ivy Bridge'' architecture, which has 32 nodes, each with two [http://ark.intel.com/products/75277 Intel E5-2680 V2] CPUs and 128GB of RAM: | ||
− | #SBATCH --exclude=cn[65-104,137-320,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10] | + | #SBATCH --exclude=cn[65-69,71-104,137-320,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10] |
==Sandy Bridge== | ==Sandy Bridge== |
Revision as of 07:43, 23 May 2019
Contents
Job arrays
Job arrays allow you to run a single submission file up to 1000 times.
You can check the exact limit using scontrol
:
scontrol show config | grep MaxArraySize # MaxArraySize = 10001
Using job arrays are equivalent to submitting your job many times and seeing which iteration is being run. To use arrays add one line to your job:
# Run only 4 jobs at a time #SBATCH --array=0-50%4
# Run all possible jobs at once #SBATCH --array=0-50
SLURM generates an environmental variable $SLURM_ARRAY_TASK_ID
that you can pass on to your program to, say, use different sets of the inputs.
This means your program should have a saved list of inputs of files,
and be able to choose a file or set of variables with the number set by $SLURM_ARRAY_TASK_ID
.
You can read more about job arrays in the sbatch manual.
If you are not running your job in a priority partition, you will find that our 8 concurrent job limit makes job arrays for situations where you have low resource usage. For example, if each of your jobs only needs one CPU core, using job arrays limits you to only using 8 CPUs as opposed to your 392 core allowance. In such high throughput situations you can use GNU Parallel which we have integrated with SLURM.
If you have a priority account,
you have unlimited concurrent jobs
and can make full use of job arrays.
If you are competing with other users in your group for compute time,
you may be better off submitting a job with a larger resource allocation and
using GNU Parallel to your program multiple times across all nodes in a single job.
An additional advantage of GNU Parallel is it allows you to resume your job where you left off
using the --joblog
option.
Job dependencies
There are several reasons for using job dependencies:
- Run the same job multiple times, if the program can resume "checkpointed" computations.
- Run jobs that should require data from previous jobs.
Generally, you will find the --dependency=singleton
option most useful,
which requires the submitted jobs to have the same name:
#SBATCH --name=singleton # Change this name to something more appropriate to your computation #SBATCH --dependency=singleton
If you really want to use other types of dependencies,
you would need to use an additional script to run sbatch
to submit your job,
read the job ID from the submitted job,
and apply the dependency to the subsequent job submission.
This is more tedious that other job schedulers which can use the job name,
but SLURM's way is more robust.
You can read more about job dependencies in the sbatch manual.
Resuming jobs
SLURM has a checkpoint/restart feature
which is intended to save a job state to disk as a checkpoint
and resume from a saved checkpoint.
This is done using the BLCR library
which is installed on all our nodes.
Intel MPI, versions 2013 and later support the BLCR checkpoint/restart library.
Intel MPI is available on our cluster as the intelics
module,
therefore see the Checkpoint/Restart options from mpiexec.hydra -h
;
this needs I_MPI_FABRICS=ofa
.
We don't have a lot of personal experiencing using the checkpoint/restart. You may need to apply the feature only at the mpirun/mpiexec level, for example, with:
mpiexec.hydra --YOUR_OPTIONS_HERE YOUR_PROGRAM
...or instead apply them at the SLURM level with:
#SBATCH --checkpoint #SBATCH--checkpoint-dir
Or apply the options in both places. You will have to see what works for your program. Definitely get in touch with us if you are experimenting with this feature and need help, as the BLCR kernel module saves log files only visible to administrators.
Jobs in the Wild West
If the particularly cluster is busy,
and your job can finish in less than a day,
you can try running your job in the general_requeue
partition.
Daily jobs are run in this special partition that will kill (or, more accurately, "preempt") any existing jobs there.
To re-run your job if it is killed you can add the --requeue
flag.
The --requeue
option is also useful for long jobs where a node might fail.
If your job fails for any other reason, it will not be requeued;
for that type of situation,
you would need to setup a job dependency as explained in the previous section.
Running jobs at particular times
We don't allow crontab on the login nodes.
You can use the --begin
option in sbatch
to set a specific start time (and date) you want the job to start.
Remotely running commands
If you need to execute a command at a specific time on a login node of the cluster, you can setup SSH keys to login to the cluster from another computer and run the command.
Cancel All of Your Jobs
If you have a large number of jobs to cancel it would be tedious to individually cancel them with scancel
. This short command will cancel all of your jobs, so use it carefully.
scancel -u $USER
Submitting a Collection of Programs to Run
If you would like to submit a job which is a collection of completely discrete program runs you can use srun. By discrete, we require that they could run simultaneously without any problems. Using srun will run the program once for each task. Meaning, that if your tasks are distributed over multiple nodes, your program will be run on those other nodes as well as the main node.
If you have a single threaded/non-MPI simulation which you'd like to run some number of times you can do:
#!/bin/bash #SBATCH --partition=general #SBATCH --ntasks=30 srun -lE hostname
Where the -E tells srun to use the ntasks and nnodes specification from the sbatch allocation. The -l implies prepending output lines with the task number. To run your program you would replace hostname with the program you wish to run.
Another common use case is to run multiple instances of a program, but with different arguments. For this, srun provides the --multi-prog flag, which you can use to provide essentially any argument and program combination. --multi-prog takes a mapping file as it's argument, not a command:
#!/bin/bash #SBATCH --partition=general #SBATCH --ntasks=30 echo "__MP1__" srun -E -l --multi-prog MP1 echo "__MP2__" srun -E -l --multi-prog MP2
The mapping files MP1 and MP2 for the above commands are:
# MP1 * echo %t is the task number
# MP2 0 hostname 1 echo task number %t 2-9 echo offset: %o 10-29 echo "This is just another program to run"
Within the upper block, which is the sbatch script, you can see both multi-prog files are called from within sbatch's resource allocation. In the multi-prog files the first column is used to match task numbers. When you run in multi-prog mode it will generate ntasks tasks each with its own task number, so if you specify 30 tasks youll get 30 tasks with numbers 0,1,2...29. If an individual task matches the number in the left column it will run the command in the right column.
In the first mapping file, MP1, the * is a special character which matches all task numbers and then assigns all tasks to do the command echo. echo is passed the argument %t, which is replaced with the task number at run time. Here we are only printing the task number, but you could use it select an item from a list of jobs to do, a file name, or even a starting value for a variable. Using the task number as a variable's value can be extremely useful as it allows you to take for loops and execute each step independently, by assigning the loop variable to one of the command line arguments. In bash this would like something like:
#!/bin/bash # Original Script for i in {0..23} do echo $i done
#!/bin/bash # modified script named: new.bash i=$1 #The $1 means first command line argument echo $i
# MP.new * ./new.bash %t
Then you can deploy the latter after setting new.bash as executable with:
srun -n 24 -N 2 -p general --multi-prog MP.new
You should be aware by now that doing things in parallel will result in out of order execution, so the for loop cannot depend on previous values.
Another thing to be aware of with multi-prog is you should avoid scheduling programs with significantly different run times together as it will delay yours' and other's start times. Since scheduling a short program and a long program together will force the short one to hold onto its resource allocation until the long one finishes, leaving resources idle.
Targeting Specific Node Architectures
Some of the SLURM partitions span multiple generations of hardware architecture, specifically: general
, serial
, and debug
. In some circumstances you may want to ensure that your jobs run on a specific node architecture, such as:
- MPI jobs may perform significantly better using homogenous nodes which are tightly coupled with Infiniband
- Some applications may have compiler optimization flags specific to CPU's built-in instructions, such as hardware AES encryption
- You may experience non-deterministic runtimes from one job to another if each runs on a different architecture
Notes:
- None of the priority Partitions span multiple hardware architectures.
- We are frequently adding new nodes, and occasionally removing old nodes, so the instructions below may change frequently. If you are experiencing an issue with job targeting, please refer back to this page and ensure that your specific command is still accurate.
- Click here to view a comparison chart of the different CPU types on our compute nodes.
SLURM's --exclude
parameter is used to target a given job to a specific hardware architecture.
Broadwell
To target your job at the Broadwell architecture, which has 4 nodes, each with two Intel E5-2699 V4 CPUs and 256GB of RAM:
#SBATCH --exclude=cn[65-320,329-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
Haswell
To target your job at the Haswell architecture, which has 175 nodes, each with two Intel E5-2690 V3 CPUs and 128GB of RAM:
#SBATCH --exclude=cn[65-69,71-136,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
Eight of the Haswell nodes have a higher amount of RAM, 192GB. These are available in the general_requeue
partition. To target these nodes:
#SBATCH --partition=general_requeue #SBATCH --exclude=cn[65-69,71-256,265-320,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
Ivy Bridge
To target your job at the Ivy Bridge architecture, which has 32 nodes, each with two Intel E5-2680 V2 CPUs and 128GB of RAM:
#SBATCH --exclude=cn[65-69,71-104,137-320,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
Sandy Bridge
To target your job at the Sandy Bridge architecture, which has 40 nodes, each with two Intel E5-2650 CPUs and 64GB of RAM:
#SBATCH --exclude=cn[105-320,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
Skylake
To target your job at the Skylake architecture, which has 14 nodes, each with two Intel Xeon Gold-6150 CPUs and 192GB of RAM:
#SBATCH --exclude=cn[65-136,153-256,265-320,325-328]
sranks
To view specific job rank details for a single user:
sranks | grep netidofuser