SLURM Usage Monitoring
After a job is submitted to SLURM, user may check a list of current jobs’ CPU/RAM/GPU usage (updated every minute) with commands showjob
as described below.
$ showjob -h Usage: showjob [OPTIONS] -x show finished job(s) in last x day default is to show running/pending job(s) -p comma separated list of partitions to view default is any partitions -j comma separated list of jobs IDs default is any job -s filter jobs with specific state CA CANCELLED CD COMPLETED CF CONFIGURING CG COMPLETING DL DEADLINE F FAILED OOM OUT_OF_MEMORY PD PENDING R RUNNING ST STOPPED S SUSPENDED TO TIMEOUT default is any state -w display jobs on any of these nodes default is any node
In the example below, job #20220 was using roughly the same amount of CPU hour and RAM across each of the 4 allocated nodes, meaning that job exercised fairly good parallel processing. Otherwise, only one or a few allocated nodes would be working and other nodes would be idle. On the other hand, only ~14% of the requested amount of RAM were utilized and user may take the Peak RAM uasge as a reference value for request for memory in subsequent submission of similar jobs.
$ showjob Job ID: 20220 Sat Jan 1 01:00:00 HKT 2022 ╒═════════════════╤════════════════════════════════════════╤══════════════════╕ │ User: tmchan │ Name: sim-2p │ State: RUNNING │ │ QoS: normal │ Partition: intel │ Priority: 17076 │ ╞═════════════════╪════════════════════╤═══════════════════╧══════════════════╡ │ Resource │ Requests │ Usage │ ├─────────────────┼────────────────────┼──────────────────────────────────────┤ │ Node │ 3 │ GPI-1-[4,7],GPI-4-10 │ │ CPU │ 96 │ 99.38% │ │ RAM │ 281.25 GB │ 14.23% │ │ Wall time │ 4-00:00:00 │ 1-16:26:22 │ │ GPU │ N/A │ N/A │ ├─────────────────┼────────────────────┼──────────────────────────────────────┤ │ Per Node Usage │ CPU hour Usage │ RAM Usage (GB) │ │ │ Up to Now │ Now Peak │ ├─────────────────┼────────────────────┼──────────────────────────────────────┤ │ GPI-1-4 │ 1285.983 │ 15.103 GB 18.998 GB │ │ GPI-1-7 │ 1286.141 │ 12.429 GB 16.328 GB │ │ GPI-4-10 │ 1285.903 │ 12.490 GB 16.385 GB │ └─────────────────┴────────────────────┴──────────────────────────────────────┘
For jobs requested GPU, the GPU ID and the GPU/RAM usage would be shown like below. As not all applications may scale the workload across multiple GPUs, taking a look at the GPU Usage below may give user insight into the actual GPU utilization. In this example, only GPU 0 was working at 100% but other GPUs (GPU 1-3) were being idle, which might suggest that revision of job resource requests or tunning in application parameters are warranted. It is advisable to check the usage from time to time as resource usage may fluctuate over the course of job execution.
$ showjob Job ID: 20221 Sat Jan 1 01:01:00 HKT 2022 ╒═════════════════╤════════════════════════════════════════╤══════════════════╕ │ User: tmchan │ Name: gpu-sim │ State: RUNNING │ │ QoS: gpu │ Partition: gpu │ Priority: 15441 │ ╞═════════════════╪════════════════════╤═══════════════════╧══════════════════╡ │ Resource │ Requests │ Usage │ ├─────────────────┼────────────────────┼──────────────────────────────────────┤ │ Node │ 1 │ SPG-1-4 │ │ CPU │ 32 │ 12.47% │ │ RAM │ 93.75 GB │ 6.97% │ │ Wall time │ 7-00:00:00 │ 4-06:52:53 │ ├─────────────────┼────────────────────┼──────────────────────┬───────┬───────┤ │ GPU │ 4(IDX:0-3) │ GPU Card │ GPU% │ RAM │ │ ├────────────────────┼──────────────────────┴───────┴───────┤ │ │ GPU-0 │Tesla-V100-SXM2-32GB 100 % 30 GB │ │ │ GPU-1 │Tesla-V100-SXM2-32GB 0 % 0 │ │ │ GPU-2 │Tesla-V100-SXM2-32GB 0 % 0 │ │ │ GPU-3 │Tesla-V100-SXM2-32GB 0 % 0 │ ├─────────────────┼────────────────────┼──────────────────────────────────────┤ │ Per Node Usage │ CPU hour Usage │ RAM Usage (GB) │ │ │ Up to Now │ Now Peak │ ├─────────────────┼────────────────────┼──────────────────────────────────────┤ │ SPG-1-4 │ 410.602 │ 6.537 GB 7.614 GB │ └─────────────────┴────────────────────┴──────────────────────────────────────┘
Show job usage for all running jobs
$ showjob
Show job usage for a job with job ID “12345”
$ showjob -j 12345
Show job usage for jobs in “intel” partition
$ showjob -p intel
After a job is finished(e.g. COMPLETED/TIMEOUT/FAILED/OUT_OF_MEMORY), user may check the CPU/RAM usage with commands showjob
-x <day> as described below. The job state (code table) would tell if a job is finished normally.
Show finished jobs today
For job “1234”, it completed normally as its state was “COMPLETED”. However, attention has to be paid for the job “1235” because it was aborted when the requested wall time of 1 day was exhausted and its state became “TIMEOUT“.
$ showjob -x 0 Job ID: 1234 Mon Jan 10 16:12:00 HKT 2022 ╒═════════════════╤═══════════════════════════════════╤═══════════════════════╕ │ User: tmchan │ Name: sim │ State: COMPLETED │ │ QoS: normal │ Partition: intel │ Exit code: 0 │ ├─────────────────┼───────────────────────────────────┴───────────────────────┤ │ Start time: │ 2022-01-06 14:05:43 │ │ End time: │ 2022-01-10 04:17:23 │ │ Wall time: │ 3-14:11:40 │ ╞═════════════════╪════════════════════╤══════════════════════════════════════╡ │ Resource │ Requests │ Usage Efficiency │ ├─────────────────┼────────────────────┼──────────────────────────────────────┤ │ Node │ 3 │ GPI-1-2,GPI-2-14,GPI-3-17 │ │ CPU │ 96 │ 95.673 99.659% │ │ RAM │ 281 GB │ 2.325 GB 0.827% │ └─────────────────┴────────────────────┴──────────────────────────────────────┘ Job ID: 1235 Mon Jan 10 16:12:01 HKT 2022 ╒═════════════════╤═══════════════════════════════════╤═══════════════════════╕ │ User: tmchan │ Name: sim │ State: TIMEOUT │ │ QoS: normal │ Partition: intel │ Exit code: 0 │ ├─────────────────┼───────────────────────────────────┴───────────────────────┤ │ Start time: │ 2022-01-09 15:07:56 │ │ End time: │ 2022-01-10 15:08:01 │ │ Wall time: │ 1-00:00:05 │ ╞═════════════════╪════════════════════╤══════════════════════════════════════╡ │ Resource │ Requests │ Usage Efficiency │ ├─────────────────┼────────────────────┼──────────────────────────────────────┤ │ Node │ 1 │ GPI-2-1 │ │ CPU │ 32 │ 31.871 99.598% │ │ RAM │ 94 GB │ 9.231 GB 9.820% │ └─────────────────┴────────────────────┴──────────────────────────────────────┘
Show a finished job today with job ID 12345
$ showjob -x 0 -j 12345
Show finished job(s) today in partition ‘gpu’
$ showjob -x 0 -p gpu
Show finished job(s) with state “TIMEOUT” today
$ showjob -x 0 -s TIMEOUT
Show finished job(s) in the past 7 days
$ showjob -x 7
Show finished job(s) in the past 7 days and in partition ‘gpu’
$ showjob -x 7 -p gpu
List detailed information for a job (for troubleshooting)
$ scontrol show job <JobID>
Checking the resource utilization of a running job
Command : ta <JOB_ID>
$ ta 216 JOBID: 216 ================================ GPA-1-20 =================================== top - 16:41:18 up 149 days, 11:54, 0 users, load average: 20.05, 19.80, 19.73 Tasks: 608 total, 2 running, 606 sleeping, 0 stopped, 0 zombie Cpu(s): 79.0%us, 1.9%sy, 0.0%ni, 16.0%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 99077612k total, 10895060k used, 88182552k free, 84436k buffers Swap: 122878968k total, 19552k used, 122859416k free, 7575444k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29144 h0xxxxxx 20 0 97.5g 84m 6200 R 1995.2 1.7 4982:46 l502.exe 2667 h0xxxxxx 20 0 15932 1500 1248 S 2.0 0.0 0:00.00 top 2622 h0xxxxxx 20 0 98.8m 1284 1076 S 0.0 0.0 0:00.00 sshd 2623 h0xxxxxx 20 0 105m 896 696 S 0.0 0.0 0:00.00 g09 2668 h0xxxxxx 20 0 100m 836 1168 S 0.0 0.0 0:00.00 226.hpc2015 29800 h0xxxxxx 20 0 105m 1172 836 R 0.0 0.0 0:00.00 bash 29801 h0xxxxxx 20 0 100m 848 728 S 0.0 0.0 0:00.00 grep 29802 h0xxxxxx 20 0 98.6m 604 512 S 0.0 0.0 0:00.00 head Filesystem Size Used Avail Use% Mounted on /dev/sda2 416G 59G 395G 1% /tmp
You can see the CPU utilization under CPU stats. This example show the process 1502.exe running in parallel on the 20-core system with 1995.2% of the CPU utilization (2000% utilization means all 20 cores of GPA-1-20 are fully used). It also provides information such as memory usage(10895060k ~ 10MB used) , runtime of the processes and local /tmp disk usage (59GB used).
Check Historical Usage Efficiencies
“showeff”-Show summary of resource usage and efficiency of finished jobs
By default, job usage and efficiencies are reported for the past 7 days.
Date range can be specified with -s YYYY-MM-DD and -e YYYY-MM-DD. Command below would show the usage between 1st Sept 2021 and 1st Sept 2022
$ showeff -s 2021-09-01 -e 2022-09-01
# Job Usage and Efficiency for period between 2021-09-01 and 2022-09-01 ╒══════════╤═══════════╤══════════╤══════════╤════════════╤════════════╤═════════════════╕ │ Username │ Job Count │ CPU Hour │ GPU Hour │ CPU Eff(%) │ MEM Eff(%) │ Walltime Eff (%)│ ├──────────┼───────────┼──────────┼──────────┼────────────┼────────────┼─────────────────┤ │ tmchan │ 2914 │ 16715 │ 336 │ 70.054 │ 50.058 │ 64.396 │ └──────────┴───────────┴──────────┴──────────┴────────────┴────────────┴─────────────────┘
Check GPU Node Usage
User can use command gpu_avail to check the status of GPU nodes
$ gpu_avail ╒═════════╤═════╤═══════════════════╤════════════════════╕ │ Compute │ GPU | TRES per node │ Available │ │ node │ GEN | CPU RAM(GB) GPU │ CPU RAM(GB) GPU │ ├─────────┼─────┼───────────────────┼────────────────────┤ │ SPG-1-1 │ VLT │ 32 384 4 │ 22 224 3 │ │ SPG-1-2 │ VLT │ 32 384 4 │ 0 290 0 │ │ SPG-1-3 │ VLT │ 32 384 4 │ 7 180 1 │ │ SPG-1-4 │ VLT │ 32 384 4 │ 0 290 0 │ │ SPG-2-1 │ VLT │ 32 384 8 │ 24 361 4 │ │ SPG-2-2 │ VLT │ 32 384 8 │ 1 293 2 │ │ SPG-2-3 │ VLT │ 32 384 8 │ 1 246 3 │ │ SPG-3-1 │ ADA │ 64 1000 10 │ 64 1000 10 │ │ SPG-3-2 │ ADA │ 64 1000 10 │ 64 1000 10 │ │ SPG-4-1 │ ADA │ 64 500 8 │ 59 484 4 │ │ SPG-4-2 │ ADA │ 64 500 8 │ 64 500 8 │ └─────────┴─────┴───────────────────┴────────────────────┘