Debugging Jobs on HPC Cluster
Contents
Debugging jobs on HPC Cluster (in progress)
Manual Diagnosis
finding out which nodes your job is using
The first step is the check to find out where your job is running (if it is at all). You can do this with the "sjobs" command or "squeue -u `whoami` -a" and look under the NodeList column:
[xxx00000@cn65 ~]$ sjobs JobID Partition QOS JobName User Submit State Elapsed NNodes NCPUS NodeList Start ------------ ---------- ---------- ---------- --------- ------------------- ---------- ---------- -------- ---------- --------------- ------------------- 23473 Westmere normal sbatch hpc-xu 2015-09-07T21:06:46 RUNNING 00:00:02 1 1 cn25 2015-09-07T21:06:46 [xxx00000@cn65 ~]$
Or
[xxx00000@cn65 ~]$ squeue -u `whoami` -a JOBID PARTITION QOS NAME USER ST TIME NODES CPUS NODELIST(REASON) 23473 Westmere normal sbatch hpc-xu R 1:08 1 1 cn25 [xxx00000@cn65 ~]
In the above example a job is only running on cn25. For large parallel jobs, multiple hosts are routinely used.
Checking processes
Before you can check your job, you need a shell on the job your machine is running on. This is easy to do with ssh.
[xxx00000@cn65 ~]$ ssh cn25 Last login: Wed Feb 8 14:17:24 2012 from cn65 [xxx00000@cn25 ~]$
NOTE: the ssh access will be denied if you do not have any jobs running on that node.
Based on the new prompt, we can see that new commands typed will be issued on the machine cn25. To see if a job is running we can use the UNIX ps command
Checking CPU utilization
To check CPU utilization on a node, first ssh to it, and then run the UNIX command "top". For more information on top, type "man top"
Checking RAM utilization
To check total system memory you can use
cat /proc/meminfo
To check how much memory is being used by your specific processes
ps aux | more
or
ps aux | grep `whoami`