SLURM Job Array Migration Guide
This guide describes how to bypass the 8 job limit by lightly restructuring your code to effectively use multiple CPUs within a single job. As a side benefit, your code will also become more resilient to failure by gaining the ability to resume where it left off.
If you have used SLURM on other clusters, you may be surprised by the 8 job limit; the reason the limit was put in place is to reduce the time between submitting your your job and it starting to run. The limit was added by the request of our users to share the cluster more fairly.
The goal if the guide is to explain concepts several underlying job parallelism, starting with SLURM Job Arrays, taking a detour to using shell job parallelism, and finally describing sophisticated parallelism and job control using GNU Parallel.
|Method||Multiple CPUs||Multiple Nodes||Resumable||Max CPUs|
Let's get started!
Job Array script
Consider this simple job array script