Jobstats
Jobstats [1] is a free and open-source job monitoring platform designed for CPU and GPU clusters that use the Slurm workload manager. The jobstats command generates a job report:
$ jobstats job.id
================================================================================
Slurm Job Statistics
================================================================================
Job ID: job.id
User/Account: user.name/pi.group
Job Name: job.name
State: RUNNING
Nodes: 1
CPU Cores: 1
CPU Memory: 64GB (64GB per CPU-core)
GPUs: 1
QOS/Partition: gpu/l40s
Cluster: hpc2021
Start Time: Sat Jan 11, 2025 at 1:11 PM
Run Time: 01:44:12 (in progress)
Time Limit: 1-00:00:00
Overall Utilization
================================================================================
CPU utilization [|||||||||||||||||||||||||||||||||||||||||||||||99%]
CPU memory usage [| 2%]
GPU utilization [|||||||||||||||||||||||||||||||||||||||||| 84%]
GPU memory usage [|||||||| 17%]
Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
SPG-4-4: 01:42:43/01:44:12 (efficiency=98.6%)
CPU memory usage per node - used/allocated
SPG-4-4: 1.2GB/64GB (1.2GB/64GB per core of 1)
GPU utilization per node
SPG-4-4 (GPU 7): 84.5%
GPU memory usage per node - maximum used/total
SPG-4-4 (GPU 7): 7.5GB/45GB (16.6%)
Notes
================================================================================
* This job only used 2% of the 64GB of total allocated CPU memory. For
future jobs, please allocate less memory by using a Slurm directive such
as --mem-per-cpu=2G or --mem=2G. This will reduce your queue times and
make the resources available to other users. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/memory
* Have a nice day!
Reportseff
Reportseff [2] is a python script for tabular display of slurm efficiency information* that makes querying job efficiency easier for cases like “all jobs from a user over a certain period of time” or “all elements under an array jobs” when compares to commands like seff or jobstats. The followings are some examples for using the command reportseff:
# By default, reportseff will search for files in the working directory with patterns like slurm_%j.out, %x_%j, or slurm_%A_%a when no arguments are supplied
$ ls
jobname_3062470.out
$ reportseff
JobID State Elapsed TimeEff CPUEff MemEff
jobname_3062470.out COMPLETED 6-17:22:59 96.1% 96.4% 89.9%
# Custom pattern can be specified by using the option --slurm-format, any format token besides %A, %a or %j is interpreted by reportseff as .* regex
$ ls
job-4340434_name.err
$ reportseff --slurm-format %x-%j_%z.err
JobID State Elapsed TimeEff CPUEff MemEff
job-4340434_name.err COMPLETED 00:03:44 24.9% 29.8% 21.4%
# Array jobs must have %A_%a to properly function
$ ls
job_5150515_1_tag.log job_5150515_2_tag.log job_5150515_3_tag.log
$ reportseff --slurm-format %x_%A_%a_%z
JobID State Elapsed TimeEff CPUEff MemEff
job_5150515_1_tag.log COMPLETED 00:00:01 1.7% --- 1.2%
job_5150515_2_tag.log COMPLETED 00:00:01 1.7% --- 1.2%
job_5150515_3_tag.log COMPLETED 00:00:01 1.7% --- 1.2%
# Show my jobs on partition l40s on past 7 days (default time window), adding gpu efficiency
$ reportseff --partition l40s --format +gpu
JobID State Elapsed TimeEff CPUEff MemEff GPUEff GPUMem
6460646 COMPLETED 1-21:33:18 91.1% 95.9% 16.6% 9.2% 24.8%
6460647 TIMEOUT 2-02:00:13 100.0% 95.2% 15.8% 10.1% 24.6%
6460648 CANCELLED 1-01:37:33 51.3% 96.7% 27.4% 5.4% 24.6%
6460649 CANCELLED 1-01:29:46 51.0% 96.7% 28.0% 4.9% 24.6%
# Show my COMPLETED jobs between 3 days ago and now, adding 2 fields (requested CPU and memory)
$ reportseff --since now-3days --until now -s CD --format +reqcpus,reqmem
JobID State Elapsed TimeEff CPUEff MemEff ReqCPUS ReqMem
6960696 COMPLETED 6-18:31:23 96.7% 97.1% 89.4% 64 975G
6960697 COMPLETED 6-23:03:54 99.4% 96.7% 95.7% 64 896G
6960698 COMPLETED 6-11:50:39 92.8% 95.6% 90.8% 64 960G
6960699 COMPLETED 6-18:16:44 96.6% 97.4% 89.8% 64 636G
6960670 COMPLETED 6-17:22:59 96.1% 96.4% 89.9% 64 664G
* Efficiency information are only shown after the jobs are completed when using reportseff
[1] https://github.com/PrincetonUniversity/jobstats
[2] https://github.com/troycomi/reportseff
