Difference between revisions of "HPC Intermediate"

From Storrs HPC Wiki
Jump to: navigation, search
(Created page with "This We assumes you have some experience with: * The Linux command-line * Loading software using the <code>module</code> command * Using <code>sbatch</code> to submit jobs...")
 
Line 34: Line 34:
 
  # remove alias
 
  # remove alias
 
  unalias htop
 
  unalias htop
  type htop
+
  type htop                 # for csh, use `where` instead
  
Challenge 1:
+
Our sjobs command is an alias!
  
  # Edit your ~/.bashrc to create an alias to `cd` to the scratch directory by typing `scratch`
+
  type sjobs                # for csh, use `where` instead
scratch
+
 
 +
Challenge 1: Create an alias to `cd` to the scratch directory by typing `scratch`
  
 
=== Functions ===
 
=== Functions ===
Line 65: Line 66:
 
Challenge 2: Why did the job fail?
 
Challenge 2: Why did the job fail?
  
  squeue
+
  # What does success look like?
  stimes
+
  stail bash-success.slurm
  
sprio
+
There are two useful debugging features enabled in "bash-debug.slurm"
  
  srun --partition=phi hostanme
+
# Getting debug information
 +
stail bash-debug.slurm
 +
 
 +
# Why is this stuck in pending?
 +
stail bash-pending.slurm
 +
# Use Control + C to cancel out.
 +
# To find why, see the output of:
 +
squeue -u $USER
 +
stimes -u $USER
 +
# Related and also useful
 +
sprio -u $USER
 +
# Cancel our job
 +
sjobs
 +
scancel -u $USER
 +
sjobs
 +
 
 +
make clean
 +
 
 +
  srun --partition=phi hostname
  
 
=== History-fu ===
 
=== History-fu ===
Line 187: Line 206:
 
   #SBATCH --partition=phi
 
   #SBATCH --partition=phi
 
   #SBATCH --time=1:00
 
   #SBATCH --time=1:00
 
  
 
[[Category:Core]]
 
[[Category:Core]]

Revision as of 00:32, 20 March 2017

This

We assumes you have some experience with:

  • The Linux command-line
  • Loading software using the module command
  • Using sbatch to submit jobs

If you are in the workshop classroom, sign-in as a student to UCONNHPC on socrative.com

Unix Shell automation

Commands below assume you are using the bash shell. Equivalent commands are given after the bash commands in case you are using the csh shell instead.

Check for the shell you are using with:

echo $0

Aliases

Save yourself typing; create aliases:

# the command we want to alias
htop
# one way
alias htop='htop -u $USER' # for csh, no =
type htop                  # for csh, use `where` instead
# another way
alias htop="htop -u $USER" # for csh, no =
type htop                  # for csh, use `where` instead
# ignore alias
command htop               # for csh, use \htop
# remove alias
unalias htop
type htop                  # for csh, use `where` instead

Our sjobs command is an alias!

type sjobs                 # for csh, use `where` instead

Challenge 1: Create an alias to `cd` to the scratch directory by typing `scratch`

Functions

cp /scratch/lesson-intermediate ~
cd ~/lesson-intermediate
# for bash
cat watch-sbatch.sh
source watch-sbatch.sh
# for csh
cat stail
set path = (`pwd` $path)
stail

Job troubleshooting

cd test-stail
# Let's see what nodes are free
sinfo -s
# Try out stail
stail bash-fail.slurm

Challenge 2: Why did the job fail?

# What does success look like?
stail bash-success.slurm

There are two useful debugging features enabled in "bash-debug.slurm"

# Getting debug information
stail bash-debug.slurm
# Why is this stuck in pending?
stail bash-pending.slurm
# Use Control + C to cancel out.
# To find why, see the output of:
squeue -u $USER
stimes -u $USER
# Related and also useful
sprio -u $USER
# Cancel our job
sjobs
scancel -u $USER
sjobs
make clean
srun --partition=phi hostname

History-fu

Ctrl + R Reverse history search of shell history.

Ctrl + G to exit search mode.

mkdir my-descriptive-long-name
cd 

Retype last word:

  • Bash: Alt + . (Use ESC on OSX for Alt)
  • csh: Alt + Shift + _

Multiple windows using tmux and screen

Environmental variables

Quiz!

TERM=xterm

Job optimization

$ sacct
$ sacct -l
# Lots of info!  Let's break that down.
$ man sacct
$ type sjobs
$ cat /scratch/lesson-intermediate/job-aliases.sh # or .csh
$ source /scratch/lesson-intermediate/job-aliases.sh # or .csh
# CPU usage
$ job-cpu
$ job-cpu 761842
       JobID      NCPUS    Elapsed     MinCPU      CPUTime 
------------ ---------- ---------- ---------- ------------ 
761842               96   05:50:33             23-08:52:48 
761842.batch         24   05:50:33   00:00:00   5-20:13:12 
761842.1             96   05:50:31   05:49:57  23-08:49:36 
# Memory usage
$ job-mem 761842
       JobID  MaxVMSize     MaxRSS     AveRSS 
------------ ---------- ---------- ---------- 
761842                                        
761842.batch    652640K      7548K      7548K 
761842.1        502984K     94908K     40215K
# Disk usage
$ job-disk 761842
       JobID  MaxDiskRead MaxDiskWrite MaxPages 
------------ ------------ ------------ -------- 
761842                                          
761842.batch           2M           1M        0 
761842.1             119M        4850M        0
# Exercise
# See disk usage for 760356
# See CPU usage for 576568

Compile and install software

Compilers / module name

  • GCC - gcc
  • Intel - intelics
  • PGI (Portland Group) - pgi

MPI

  • MPICH (MPI "Chamelion" started by Argonne National Labs) - mpi/mpich
  • MVAPICH (InfiniBand optimized version by Ohio State University) - mpi/mvapich and mpi/mvapich2
  • OpenMPI - mpi/openmpi
  • Intel MPI - intelics

Shell script

To compile your software

To compile on a node, try using the debug partition and limiting the architecture.

Troubleshooting

LAMMPS error - remove MPI

Appendix

Answers

Challenge 1

# bash
alias scratch='cd /scratch/$USER'
# csh (no equal sign)
alias scratch 'cd /scratch/$USER'

Challenge 2

There is a typo in the submission file. Because of the space character between bin and /bash, SLURM is looking for an executable file called bin in the root directory and passing it the first argument /bash

$ diff -u bash-fail.slurm bash-success.slurm 
--- bash-fail.slurm	2017-03-19 22:13:31.473867000 -0400
+++ bash-success.slurm	2017-03-19 22:15:09.780362131 -0400
@@ -1,4 +1,4 @@
-#!/bin /bash
+#!/bin/bash
 # Submit a 1 minute job.
 #SBATCH --partition=phi
 #SBATCH --time=1:00