Difference between revisions of "SLURM Job Array Migration Guide"

From Storrs HPC Wiki
Jump to: navigation, search
(Maybe use Stan for the example)
(Exaplin rtm is related to tbb)
Line 59: Line 59:
module purge
module purge
module load gcc/9.2.0
module load gcc/9.2.0
# This unsets RTM_KEY because rtm instructions are only available on SkyLake.
# Note that we unset the RTM_KEY of tbb because rtm instructions are only available on SkyLake,
# but we want to be able to run Stan on older CPUs.
make -j RTM_KEY= build
make -j RTM_KEY= build

Revision as of 17:15, 22 June 2020

This guide describes how to bypass the 8 job limit by lightly restructuring your code to effectively use multiple CPUs within a single job. As a side benefit, your code will also become more resilient to failure by gaining the ability to resume where it left off.

If you have used SLURM on other clusters, you may be surprised by the 8 job limit; the reason the limit was put in place is to reduce the time between submitting your your job and it starting to run. The limit was added by the request of our users to share the cluster more fairly.

The goal if the guide is to explain concepts several underlying job parallelism, starting with SLURM Job Arrays, taking a detour to using shell job parallelism and xargs, and finally describing sophisticated parallelism and job control using GNU Parallel.

Method Multiple CPUs Multiple Nodes Resumable Max CPUs
Job Arrays Yes Yes Manual (See note) 8
Bash Jobs Yes No No 24
xargs Yes No No 24
GNU Parallel Yes Yes Yes 192
MPI Yes Yes Maybe 192
Note: Assume each job step uses 1 CPU

Let's get started!


Let's solve an authentic task of Bayesian inference using the Stan language.

The command-line version of the stan program cannot be shared as a module and is instead meant to be compiled our home directory, because the way Stan works is by compiling model programs before running them. The setup should take about 10 minutes. Run these commands in your shell:

wget https://github.com/stan-dev/cmdstan/releases/download/v2.23.0/cmdstan-2.23.0.tar.gz
tar -xf cmdstan-2.23.0.tar.gz
cd cmdstan-2.23.0/
module purge
module load gcc/9.2.0
# Note that we unset the RTM_KEY of tbb because rtm instructions are only available on SkyLake,
# but we want to be able to run Stan on older CPUs.
make -j RTM_KEY= build

Build and run the example model as described by make. Compiling models only uses one CPU core at a time, so no need to use -j here.

make examples/bernoulli/bernoulli
examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R
bin/stansummary output.csv

Job Array script

Consider this simple job array script which we will save as submit.slurm

#SBATCH --partition debug
#SBATCH --ntasks 1
#SBATCH --array 1-5

# Load only required modules.
module purge
module load \
       gcc/9.2.0 \

# Run parameter index set by the SLURM array index.
Rscript model_fit.R ${SLURM_ARRAY_TASK_ID}

The R script is model_fit.R

# Read the index for the list of parameters from the command line.
args <- commandLineArgs(trailingOnly = TRUE)
param_idx <- as.integer(args[[1]])

# Read in the parameters for this index.
params <- read.csv("")
params <- subset(params, params$idx == param_idx)

# Load data from builtin "datasets" package.

# Fit the model to the parameters and save the result.
formula <- y ~ x_0 + x_1

This is what our parameters.csv

idx, niter, tol
  1,   100, 1e-4
  2,  1000, 1e-8
  3,  3000, 1e-8
  4,  2000, 1e-8
  5,  3000, 1e-4

We use R in this example because we can run all our statistical computing without loading any additional libraries and so it should "just work" without needing any additional setup on your part.

Bash jobs


GNU Parallel