Difference between revisions of "SLURM Guide"
(Add Matheou to HaswellPriority) |
(Explain how to associate CPUs with tasks per ticket 37966) |
||
Line 142: | Line 142: | ||
Of course, replace <code>sh -c ...</code> segment with your multi-threaded program command. | Of course, replace <code>sh -c ...</code> segment with your multi-threaded program command. | ||
+ | |||
+ | == Run one multi-core program and ensure all cores are on the same node == | ||
+ | |||
+ | In this example, our R script is run 4 times with different inputs 25-28, and each one has 3 CPU cores. | ||
+ | |||
+ | We confirm the number of CPU cores using the <code>nproc</code> command. | ||
+ | |||
+ | #!/bin/bash | ||
+ | #SBATCH --partition debug | ||
+ | #SBATCH --time 0:05 | ||
+ | #SBATCH --ntasks 4 | ||
+ | #SBATCH --cpus-per-task 3 | ||
+ | #SBATCH --output out.multicore | ||
+ | |||
+ | # Overwrite output file instead of appending. | ||
+ | echo -n > out.multicore | ||
+ | |||
+ | # Dump useful information about the job. | ||
+ | scontrol show job $SLURM_JOB_ID | ||
+ | |||
+ | # Check the number of CPUs per task is setup correctly for SLURM. | ||
+ | srun --label --ntasks $SLURM_NTASKS \ | ||
+ | sh -c 'echo $(hostname -s):$(nproc)' | ||
+ | |||
+ | # Always put in a short delay between 2 subsequent srun commands, | ||
+ | # Otherwise the second srun command can fail to launch with an error like | ||
+ | # slurmstepd: execve(): : No such file or directory | ||
+ | sleep 2s | ||
+ | |||
+ | # Run all the RStan models. | ||
+ | # | ||
+ | # The SLURM_PROCID will be a number from 0 to the number of tasks. | ||
+ | # Therefore we add this number to our first rep; rep 25 in this case. | ||
+ | module purge | ||
+ | module load gcc/5.4.0-alt r/3.5.1-gcc540 mpi/openmpi/1.10.7-gcc540 | ||
+ | |||
+ | srun --label --ntasks $SLURM_NTASKS \ | ||
+ | sh -c 'Rscript --vanilla my_rstan_script.R $(( 25 + $SLURM_PROCID ))' | ||
+ | |||
+ | The simple my_rstan_script.R | ||
+ | |||
+ | options(echo = TRUE) | ||
+ | commandArgs(trailingOnly = TRUE) | ||
+ | |||
+ | The output: | ||
+ | |||
+ | cat out.multicore | ||
+ | |||
+ | ---8<---8<---8<--- | ||
+ | 0: gpu01:3 | ||
+ | 2: gpu01:3 | ||
+ | 1: gpu01:3 | ||
+ | 3: gpu01:3 | ||
+ | 1: > commandArgs(trailingOnly = TRUE) | ||
+ | 2: > commandArgs(trailingOnly = TRUE) | ||
+ | 0: > commandArgs(trailingOnly = TRUE) | ||
+ | 3: > commandArgs(trailingOnly = TRUE) | ||
+ | 1: [1] | ||
+ | 1: "26" | ||
+ | 2: [1] | ||
+ | 1: > | ||
+ | 2: "27" | ||
+ | 2: > | ||
+ | 2: | ||
+ | 0: [1] | ||
+ | 0: "25" | ||
+ | 0: > | ||
+ | 0: | ||
+ | 3: [1] | ||
+ | 3: "28" | ||
+ | 3: > | ||
+ | 3: | ||
+ | |||
== Interactive Jobs == | == Interactive Jobs == |
Revision as of 10:17, 17 September 2018
Simple Linux Utility for Resource Management | |
---|---|
Author | Lawrence Livermore National Laboratory |
Website | http://slurm.schedmd.com |
Source | GitHub |
Category | Job scheduler |
Help | Man pages Mailing list Tutorials |
To optimally and fairly use the cluster, all application programs must be run using the job scheduler, SLURM. When you use SLURM's sbatch
command, your application program gets submitted as a "job". Please do not run application programs directly from the command-line when you connect to the cluster. Doing so may slow down performance for other users, and your commands will be automatically throttled or terminated. To better understand how applications get submitted as jobs, let's review the difference between login nodes and compute nodes.
- Login node
- When you connect to the cluster and see
[<YourNetID>@cn01 ~]
, you are connected to a single shared computer with all your fellow users, known as the "login node". The purpose of the "login" node is for you to submit jobs, copy data, edit programs, etc. The programs that are allowed to run on login nodes is listed in our usage policy. - Compute nodes
- These computers do the heavy lifting of running your programs. However you do not directly interact with compute nodes. You ask for the scheduler for compute nodes to run your application program using SLURM, and then SLURM will find available compute nodes and run your application program on them.
Contents
Job Submission
Your applications are submitted to SLURM using the sbatch
command. This is known as submitting a job. The sbatch
command takes as an argument a script describing the resources to be allocated and the actual executable to be run on the cluster. The script can be as specific or general as desired, with each argument on a new line preceded by #SBATCH
. The script must start with #!/bin/bash
.
First, login to the cluster:
$ ssh NetID@login.storrs.hpc.uconn.edu
Use nano
or your favorite text editor to create your job submission script. Here is a very simple job example:
[NetID@login1 ~]$ nano myJob.sh
#!/bin/bash #SBATCH --ntasks=1 # Job only requires 1 CPU core #SBATCH --time=5 # Job should run for no more than 5 minutes echo "Hello, World" # The actual command to run
Save your submission script and then submit your job:
[NetID@login1 ~]$ sbatch myJob.sh Submitted batch job 279934
You can view the status of your job with the sjobs
command, as described later in this guide. The output of your job will be in the current working directory in a file named slurm-JobID.out
, where JobID
is the number returned by sbatch
in the example above.
[NetID@login1 ~]$ ls *.out slurm-279934.out [NetID@login1 ~]$ cat slurm-279934.out Hello, World
Job Examples
The HPC cluster is segmented into groups of identical resources called Partitions. All jobs submitted to the cluster run within one of these Partitions. If you do not select a Partition explicitly the scheduler will put your job into the default Partition, which is called general
. Each Partition has defined limits for job runtime and core usage, with specific details available on the usage policy page. You can view a list of all partitions and their status by running the sinfo
command.
Below are multiple examples for how to submit a job in different scenarios.
Default Partition (general)
The default Partition is named general
and is meant for broad use cases (MPI jobs, serial jobs, interactive jobs, etc.). It spans multiple generations of compute nodes. So, please be advised that if you request more than one node, you may receive nodes with different configurations (cores, memory, etc.). Please read the Advanced SLURM Guide for examples of how to ensure your jobs run on a specific node architecture. This partition allows you to request up to 192 cores, and run for up to 12 hours.
The example job below requests 48 CPU cores for two hours, and emails the specified address upon completion:
#!/bin/bash #SBATCH --partition=general # Name of partition #SBATCH --ntasks=48 # Request 48 CPU cores #SBATCH --time=02:00:00 # Job should run for up to 2 hours (for example) #SBATCH --mail-type=END # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL) #SBATCH --mail-user=first.last@uconn.edu # Destination email address myapp --some-options path/to/app/parameters # Replace with your application's commands
General Use Partition, with Requeue (general_requeue)
The general_requeue
partition makes available unused compute cycles on partitions used by specialty use cases. It currently features 13 nodes from the Haswell architecture (312 cores). Jobs in this partition may be killed at any time, but are then automatically requeued. As a result, there are often available resources in this partition, when others are full. If your job is designed to be able to restart at any point in its execution, this partition is a great way to access available resources. This partition allows you to request up to 192 cores, and run for up to 12 hours.
The example job below requests 48 CPU cores for two hours, and emails the specified address upon completion or upon being requeued:
#!/bin/bash #SBATCH --partition=general_requeue # Name of Partition #SBATCH --requeue # Allow the job to be requeued, if preempted #SBATCH --ntasks=48 # Maximum CPU cores for job #SBATCH --nodes=2 # Ensure all cores are from whole nodes #SBATCH --time=02:00:00 # Job should run for up to 2 hours (for example) #SBATCH --mail-type=END,REQUEUE # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL) #SBATCH --mail-user=first.last@uconn.edu # Destination email address myapp --some-options path/to/app/parameters # Replace with your application's commands
Test/debug Partition (debug)
The Partition named debug
is optimized for very short running jobs for the purpose of testing, debugging, or compiling. It spans multiple generations of compute nodes. Please read the Advanced SLURM Guide for examples of how to ensure your jobs run on a specific node architecture. This Partition allows you to request up to 24 cores, and run for up to 30 minutes.
#!/bin/bash #SBATCH --partition=debug # Name of Partition #SBATCH --ntasks=4 # Maximum CPU cores for job #SBATCH --nodes=1 # Ensure all cores are from the same node #SBATCH --time=5 # Job should run for up to 5 minutes (for example) #SBATCH --mail-type=END # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL) #SBATCH --mail-user=first.last@uconn.edu # Destination email address myapp --some-options path/to/app/parameters # Replace with your application's commands
MPI-optimized Partition (parallel)
The Partition named parallel
is optimized for jobs requiring multiple nodes that communicate with MPI. All of the nodes in this Partition are configured with identical resources (cores, memory, etc.). This Partition allows you to request up to 384 cores, and run for up to 6 hours.
#!/bin/bash #SBATCH --partition=parallel # Name of Partition #SBATCH --ntasks=240 # Request 256 CPU cores #SBATCH --time=01:30:00 # Job should run for up to 1.5 hours (for example) #SBATCH --mail-type=END # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL) #SBATCH --mail-user=first.last@uconn.edu # Destination email address mpirun myapp --some-options path/to/app/parameters # Replace with your application's commands
Single Node Partitions (serial)
The Partitions name serial
is optimized for small jobs, requiring from 1 to 24 cores. It is not appropriate for MPI jobs. This partition may span multiple generations of compute nodes. Please read the Advanced SLURM Guide for examples of how to ensure your jobs run on a specific node architecture.
The serial
partition allow you to request up to 24 cores, and run for up to 7 days.
The below example is for using the serial
partition:
#!/bin/bash #SBATCH --partition=serial # Name of Partition #SBATCH --ntasks=24 # Maximum CPU cores for job #SBATCH --nodes=1 # Ensure all cores are from the same node #SBATCH --time=02-00:00:00 # Job should run for up to 2 days (for example) #SBATCH --mail-type=END # Event(s) that triggers email notification (BEGIN,END,FAIL,ALL) #SBATCH --mail-user=first.last@uconn.edu # Destination email address myapp --some-options path/to/app/parameters # Replace with your application's commands
Run one multi-threaded program per node
In some cases, one is running a multi-threaded program that needs to be spawned on each node.
We need the program to have access to all available CPU cores on that node.
We can use the srun
process manager to allocate all of our CPU cores to run our program,
and to ensure all the CPUs of a given node are allocated to the program on that node,
use the --cpu_bind=boards
option:
#!/bin/bash #SBATCH --partition=general #SBATCH --ntasks=30 srun \ --nodes $SLURM_NNODES \ --ntasks $SLURM_NNODES \ --cpu_bind=boards \ sh -c 'echo $(hostname -s):$(nproc)'
The nproc
value is actually the number of CPUs available from cgroups.
Of course, replace sh -c ...
segment with your multi-threaded program command.
Run one multi-core program and ensure all cores are on the same node
In this example, our R script is run 4 times with different inputs 25-28, and each one has 3 CPU cores.
We confirm the number of CPU cores using the nproc
command.
#!/bin/bash #SBATCH --partition debug #SBATCH --time 0:05 #SBATCH --ntasks 4 #SBATCH --cpus-per-task 3 #SBATCH --output out.multicore # Overwrite output file instead of appending. echo -n > out.multicore # Dump useful information about the job. scontrol show job $SLURM_JOB_ID # Check the number of CPUs per task is setup correctly for SLURM. srun --label --ntasks $SLURM_NTASKS \ sh -c 'echo $(hostname -s):$(nproc)' # Always put in a short delay between 2 subsequent srun commands, # Otherwise the second srun command can fail to launch with an error like # slurmstepd: execve(): : No such file or directory sleep 2s # Run all the RStan models. # # The SLURM_PROCID will be a number from 0 to the number of tasks. # Therefore we add this number to our first rep; rep 25 in this case. module purge module load gcc/5.4.0-alt r/3.5.1-gcc540 mpi/openmpi/1.10.7-gcc540 srun --label --ntasks $SLURM_NTASKS \ sh -c 'Rscript --vanilla my_rstan_script.R $(( 25 + $SLURM_PROCID ))'
The simple my_rstan_script.R
options(echo = TRUE) commandArgs(trailingOnly = TRUE)
The output:
cat out.multicore
---8<---8<---8<--- 0: gpu01:3 2: gpu01:3 1: gpu01:3 3: gpu01:3 1: > commandArgs(trailingOnly = TRUE) 2: > commandArgs(trailingOnly = TRUE) 0: > commandArgs(trailingOnly = TRUE) 3: > commandArgs(trailingOnly = TRUE) 1: [1] 1: "26" 2: [1] 1: > 2: "27" 2: > 2: 0: [1] 0: "25" 0: > 0: 3: [1] 3: "28" 3: > 3:
Interactive Jobs
If you require an interactive job, use the fisbatch
command instead of sbatch
. This command does not use a submission script. Instead, all of the options from the submission script are given on the command line, without the #SBATCH
keyword.
For example:
$ fisbatch --ntasks=12 --nodes=1 #Starts an interactive shell using 12 CPUs on 1 node
NOTE: please don't forget to exit when you finish your job. And, although many programs have a graphical interface, we recommend that all jobs use a command-line interface if supported by the application.
To use a custom partition with interactive jobs, specify the --partition
parameter:
$ fisbatch --partition=serial ...
Re-attach the Interactive Jobs
If you suddenly lost the connection from the interactive screen, you can try the following command to link back.
NOTE: it is not guaranteed to access the screen successfully everytime. If it fails, it means this interactive job is not accessable anymore. Please scancel
it.
First you need to get the JobID of the FISBATCH job:
$ sjobs JobID Partition QOS JobName User Submit State Elapsed NNodes NCPUS NodeList Start ------------ ---------- ---------- ---------- --------- ------------------- ---------- ---------- -------- ---------- --------------- ------------------- 890246 general general FISBATCH tix11001 2017-04-23T14:42:29 RUNNING 00:04:37 1 16 cn69 2017-04-23T14:42:30 None
Then, you can re-attach the job by JobID:
$ fisattach 890246 FISBATCH -- attaching for JOBID 890246 ! FISBATCH -- Connecting to head node (cn69)
Specify a Job Runtime
It is always recommended to specify a job time limit using the --time
parameter in your submission script. If you do not specify this parameter, the scheduler assumes that your job will take the maximum allowable time for the Partition on which you're running. For example, if SLURM calculates that a given compute node will be idle for four hours, and your job specifies --time-02:00:00
, then your job will be allowed to run. If you didn't specify this parameter with your submission, then your job would continue to wait for available resources. If you do specify this parameter but your job doesn't complete within the specified time, your job will be cancelled with a state of TIMEOUT
.
Here are some examples for using the --time
parameter:
#SBATCH --time=01-02:03:04 # Job will run at most 1day 2hours 3mins and 4seconds #SBATCH --time=01:02:03 # Job will run at most 1hour 2mins 3seconds #SBATCH --time=01:02 # Job will run at most 1min 2seconds #SBATCH --time=1 # Job will run at most 1 min
You can learn more details about this feature in the sbatch
manual:
$man sbatch
Then search for --time
.
Checking the Status of a Job
To view your active jobs:
sjobs # View all jobs sjobs -j {JOBID} # View a specific job
Alternatively, the squeue
command may be more descriptive for jobs in a PENDING
state:
squeue -u `whoami`
To view your job history:
shist # View all jobs shist -j {JOBID} # View a specific job
To view all the jobs in the cluster:
squeue -a
To review the hosts usage:
sinfo
To review all the job logs:
slogs
How to Terminate a Job
To terminate a single job:
scancel {JOBID}
To terminate all of your jobs:
scancel -u `whoami`
Priority Users
If you have been granted access to priority resources through our condo model then you need to submit your jobs using a custom Partition in order to avoid resource limits.
For example:
#SBATCH --partition=HaswellPriority
The table below lists the different priority Partitions, and the faculty who have access.
Partition Name | Faculty |
---|---|
WestmerePriority |
Wagstrom |
SandyBridgePriority |
Astitha, Dongare, Nakhmanson |
IvyBridgePriority |
Astitha, May |
HaswellPriority |
Brown, Li, Anagnostou, Zhao, Rajasekaran, Ramprasad, Alpay, Dept. of Economics, Matheou |
Haswell192Priority |
Dongare |
Broadwell44Priority |
Mellor |