Difference between revisions of "Advanced SLURM"
m (→Broadwell: cn321-cn324 not including in /etc/hosts)
|Line 135:||Line 135:|
Revision as of 10:44, 20 June 2017
Job arrays allow you to run a single submission file up to 1000 times.
You can check the exact limit using
scontrol show config | grep MaxArraySize # MaxArraySize = 1001
Using job arrays are equivalent to submitting your job many times and seeing which iteration is being run. To use arrays add one line to your job:
# Run only 4 jobs at a time #SBATCH --array=0-50%4
# Run all possible jobs at once #SBATCH --array=0-50
SLURM generates an environmental variable
that you can pass on to your program to, say, use different sets of the inputs.
This means your program should have a saved list of inputs of files,
and be able to choose a file or set of variables with the number set by
You can read more about job arrays in the sbatch manual.
If you are not running your job in a priority partition, you will find that our 8 concurrent job limit makes job arrays for situations where you have low resource usage. For example, if each of your jobs only needs one CPU core, using job arrays limits you to only using 8 CPUs as opposed to your 392 core allowance. In such high throughput situations you can use GNU Parallel which we have integrated with SLURM.
If you have a priority account,
you have unlimited concurrent jobs
and can make full use of job arrays.
If you are competing with other users in your group for compute time,
you may be better off submitting a job with a larger resource allocation and
using GNU Parallel to your program multiple times across all nodes in a single job.
An additional advantage of GNU Parallel is it allows you to resume your job where you left off
There are several reasons for using job dependencies:
- Run the same job multiple times, if the program can resume "checkpointed" computations.
- Run jobs that should require data from previous jobs.
Generally, you will find the
--dependency=singleton option most useful,
which requires the submitted jobs to have the same name:
#SBATCH --name=singleton # Change this name to something more appropriate to your computation #SBATCH --dependency=singleton
If you really want to use other types of dependencies,
you would need to use an additional script to run
sbatch to submit your job,
read the job ID from the submitted job,
and apply the dependency to the subsequent job submission.
This is more tedious that other job schedulers which can use the job name,
but SLURM's way is more robust.
You can read more about job dependencies in the sbatch manual.
SLURM has a checkpoint/restart feature
which is intended to save a job state to disk as a checkpoint
and resume from a saved checkpoint.
This is done using the BLCR library
which is installed on all our nodes.
Intel MPI, versions 2013 and later support the BLCR checkpoint/restart library.
Intel MPI is available on our cluster as the
therefore see the Checkpoint/Restart options from
We don't have a lot of personal experiencing using the checkpoint/restart. You may need to apply the feature only at the mpirun/mpiexec level, for example, with:
mpiexec.hydra --YOUR_OPTIONS_HERE YOUR_PROGRAM
...or instead apply them at the SLURM level with:
#SBATCH --checkpoint #SBATCH--checkpoint-dir
Or apply the options in both places. You will have to see what works for your program. Definitely get in touch with us if you are experimenting with this feature and need help, as the BLCR kernel module saves log files only visible to administrators.
Jobs in the Wild West
If the particularly cluster is busy,
and your job can finish in less than a day,
you can try running your job in the
Daily jobs are run in this special partition that will kill (or, more accurately, "preempt") any existing jobs there.
To re-run your job if it is killed you can add the
--requeue option is also useful for long jobs where a node might fail.
If your job fails for any other reason, it will not be requeued;
for that type of situation,
you would need to setup a job dependency as explained in the previous section.
Running jobs at particular times
We don't allow crontab on the login nodes.
You can use the
--begin option in
sbatch to set a specific start time (and date) you want the job to start.
Remotely running commands
If you need to execute a command at a specific time on a login node of the cluster, you can setup SSH keys to login to the cluster from another computer and run the command.
Cancel All of Your Jobs
If you have a large number of jobs to cancel it would be tedious to individually cancel them with
scancel. This short command will cancel all of your jobs, so use it carefully.
scancel -u $USER
Targeting Specific Node Architectures
Some of the SLURM partitions span multiple generations of hardware architecture, specifically:
debug. In some circumstances you may want to ensure that your jobs run on a specific node architecture, such as:
- MPI jobs may perform significantly better using homogenous nodes which are tightly coupled with Infiniband
- Some applications may have compiler optimization flags specific to CPU's built-in instructions, such as hardware AES encryption
- You may experience non-deterministic runtimes from one job to another if each runs on a different architecture
- None of the priority Partitions span multiple hardware architectures.
- We are frequently adding new nodes, and occasionally removing old nodes, so the instructions below may change frequently. If you are experiencing an issue with job targeting, please refer back to this page and ensure that your specific command is still accurate.
- Click here to view a comparison chart of the different CPU types on our compute nodes.
--exclude parameter is used to target a given job to a specific hardware architecture.
To target your job at the Broadwell architecture, which has 4 nodes, each with two Intel E5-2699 V4 CPUs and 256GB of RAM:
To target your job at the Haswell architecture, which has 175 nodes, each with two Intel E5-2690 V3 CPUs and 128GB of RAM:
Eight of the Haswell nodes have a higher amount of RAM, 192GB. These are available in the
general_requeue partition. To target these nodes:
#SBATCH --partition=general_requeue #SBATCH --exclude=cn[01-256,265-320,325-328]
To target your job at the Ivy Bridge architecture, which has 32 nodes, each with two Intel E5-2680 V2 CPUs and 128GB of RAM:
To target your job at the Sandy Bridge architecture, which has 40 nodes, each with two Intel E5-2650 CPUs and 64GB of RAM: