SLURM Job Array Migration Guide

From Storrs HPC Wiki
Revision as of 17:32, 22 June 2020 by Pan14001 (talk | contribs) (Use LDFLAGS to avoid loading gcc)
Jump to: navigation, search

This guide describes how to bypass the 8 job limit by lightly restructuring your code to effectively use multiple CPUs within a single job. As a side benefit, your code will also become more resilient to failure by gaining the ability to resume where it left off.

If you have used SLURM on other clusters, you may be surprised by the 8 job limit; the reason the limit was put in place is to reduce the time between submitting your your job and it starting to run. The limit was added by the request of our users to share the cluster more fairly.

The goal if the guide is to explain concepts several underlying job parallelism, starting with SLURM Job Arrays, taking a detour to using shell job parallelism and xargs, and finally describing sophisticated parallelism and job control using GNU Parallel.

Method Multiple CPUs Multiple Nodes Resumable Max CPUs
Job Arrays Yes Yes Manual (See note) 8
Bash Jobs Yes No No 24
xargs Yes No No 24
GNU Parallel Yes Yes Yes 192
MPI Yes Yes Maybe 192
Note: Assume each job step uses 1 CPU

Let's get started!


Let's solve an authentic task of Bayesian inference using the Stan language.

The command-line version of the stan program cannot be shared as a module and is instead meant to be compiled our home directory, because the way Stan works is by compiling model programs before running them. The setup should take about 10 minutes. Run these commands in your shell:

tar -xf cmdstan-2.23.0.tar.gz
cd cmdstan-2.23.0/
module purge
module load gcc/9.2.0
# Note that we unset the RTM_KEY of tbb because rtm instructions are only available on SkyLake,
# but we want to be able to run Stan on older CPUs.
make -j RTM_KEY= build

Build and run the example model as described by make. Compiling models only uses one CPU core at a time, so no need to use -j here.

# Set LDFLAGS so that we can run the model without loading gcc.
make LDFLAGS=-Wl,-rpath,/apps2/gcc/9.2.0/lib64 examples/bernoulli/bernoulli
examples/bernoulli/bernoulli sample data file=examples/bernoulli/
bin/stansummary output.csv

Job Array script

Consider this simple job array script which we will save as submit.slurm

#SBATCH --partition debug
#SBATCH --ntasks 1
#SBATCH --array 1-5

# Load only required modules.
module purge
module load \
       gcc/9.2.0 \

# Run parameter index set by the SLURM array index.
Rscript model_fit.R ${SLURM_ARRAY_TASK_ID}

The R script is model_fit.R

# Read the index for the list of parameters from the command line.
args <- commandLineArgs(trailingOnly = TRUE)
param_idx <- as.integer(args[[1]])

# Read in the parameters for this index.
params <- read.csv("")
params <- subset(params, params$idx == param_idx)

# Load data from builtin "datasets" package.

# Fit the model to the parameters and save the result.
formula <- y ~ x_0 + x_1

This is what our parameters.csv

idx, niter, tol
  1,   100, 1e-4
  2,  1000, 1e-8
  3,  3000, 1e-8
  4,  2000, 1e-8
  5,  3000, 1e-4

We use R in this example because we can run all our statistical computing without loading any additional libraries and so it should "just work" without needing any additional setup on your part.

Bash jobs


GNU Parallel