SLURM Guide

From Storrs HPC Wiki
Revision as of 09:50, 28 August 2014 by Mpz13001 (talk | contribs) (Created page with "[http://www.platform.com/ Platform] [http://www.platform.com/workload-management/high-performance-computing LSF] is our job scheduler. A ll jobs on the cluster '''must''' be ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Platform LSF is our job scheduler. A

ll jobs on the cluster must be submitted and managed using LSF. Failure to do so will result in killed jobs.

These directions assume that you are connected to the HORNET cluster via SSH.

Job Submission - bsub (docs)

Commands are submitted to lsf using the bsub command. This is known as submitting a job.

  • To submit a simple job (one that uses a single CPU core):
bsub {COMMAND}
  • To submit a job that uses an arbitrary number of cores:
bsub -n {NUM_CORES} {COMMAND}

LSF will send you the output (stdout) of the job to your email when it is complete if you specify one. To do this:

bsub -u {EMAIL_ADDRESS} {COMMAND}

Or, instead , you can write stdout to a file:

bsub -o {OUTPUT_FILE} {COMMAND}

Interactive

  • If you require an interactive job use the -I, -Ip (for a psuedo-terminal), or -Is (for a interactive script).
bsub -I {COMMAND}

For example,

bsub -Is R


All jobs should be run via the command-line interface. Many programs you are familiar with that have a graphic interface, also have command-line interfaces as well.

Excluding a node

To exclude a node, (say cn01) run

bsub -R "select[hname!='cn01']"

Submitting to Different Queues

Different queues allow access to different nodes: to access the new Sandy Bridge nodes or the GPU nodes you will have to submit to a separate queue. Documentation for submitting jobs to different queues can be found here.

Job status - bjobs (docs)

To view the jobs you've submitted

bjobs -u {USERNAME}

This will tell you the JOBID, it status (STAT), which QUEUE it is in, the host you submitted the job from (FROM_HOST), which host it is executing on (EXEC_HOST), the JOB_NAME and the SUBMIT_TIME.

For detailed information about a specific job

bjobs -l {JOBID}

Other useful LSF tricks

To kill a job,

bkill {JOBID}

To removes a job from the system without waiting for the job to terminate in the operating system:

bkill -r {JOBID}
  • To close your script use ctr-C
  • Remember that the admin team is not responsible for your program being disrupted. Failing to adhere to this protocol can result in loss of data, being kicked off the cluster, and/or a temporary ban.

For more information, check the command reference.

The Platform LSF Knowledge Center is also available here