SLURM Job Array Migration Guide

From Storrs HPC Wiki
Revision as of 10:58, 22 June 2020 by Pan14001 (talk | contribs) (Add introduction and skeleton for sections)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This guide describes how to bypass the 8 job limit by lightly restructuring your code to effectively use multiple CPUs within a single job. As a side benefit, your code will also become more resilient to failure by gaining the ability to resume where it left off.

If you have used SLURM on other clusters, you may be surprised by the 8 job limit; the reason the limit was put in place is to reduce the time between submitting your your job and it starting to run. The limit was added by the request of our users to share the cluster more fairly.

The goal if the guide is to explain concepts several underlying job parallelism, starting with SLURM Job Arrays, taking a detour to using shell job parallelism, and finally describing sophisticated parallelism and job control using GNU Parallel.

Method Multiple CPUs Multiple Nodes Resumable Max CPUs
Job Arrays Yes Yes Manual 8
Bash Jobs Yes No No 24
xargs Yes No No 24
GNU Parallel Yes Yes Yes 192
MPI Yes Yes Maybe 192

Let's get started!

Job Array script

Consider this simple job array script

Bash jobs


GNU Parallel