Debugging Jobs on HPC Cluster

Revision as of 17:16, 19 February 2016
Debugging jobs on HPC Cluster (in progress)

Manual Diagnosis

finding out which nodes your job is using

The first step is the check to find out where your job is running (if it is at all). You can do this with the "sjobs" command or "squeue -u `whoami` -a" and look under the NodeList column:

[xxx00000@cn65 ~]$ sjobs
       JobID  Partition        QOS    JobName      User              Submit      State    Elapsed   NNodes      NCPUS        NodeList               Start
------------ ---------- ---------- ---------- --------- ------------------- ---------- ---------- -------- ---------- --------------- -------------------
23473          Westmere     normal     sbatch    hpc-xu 2015-09-07T21:06:46    RUNNING   00:00:02        1          1            cn25 2015-09-07T21:06:46
[xxx00000@cn65 ~]$ 


[xxx00000@cn65 ~]$ squeue -u `whoami` -a
             23473  Westmere    normal   sbatch   hpc-xu  R       1:08      1      1 cn25
[xxx00000@cn65 ~]

In the above example a job is only running on cn25. For large parallel jobs, multiple hosts are routinely used.

Checking processes

Before you can check your job, you need a shell on the job your machine is running on. This is easy to do with ssh.

 [xxx00000@cn65 ~]$ ssh cn25
 Last login: Wed Feb  8 14:17:24 2012 from cn65
 [xxx00000@cn25 ~]$ 

NOTE: the ssh access will be denied if you do not have any jobs running on that node.

Based on the new prompt, we can see that new commands typed will be issued on the machine cn25. To see if a job is running we can use the UNIX ps command

Checking CPU utilization

To check CPU utilization on a node, first ssh to it, and then run the UNIX command "top". For more information on top, type "man top"

Checking RAM utilization

To check total system memory you can use

 cat /proc/meminfo

To check how much memory is being used by your specific processes

 ps aux | more


 ps aux | grep `whoami`