jobstats

Jobstats [1] is a free and open-source job monitoring platform designed for CPU and GPU clusters that use the Slurm workload manager. The jobstats command generates a job report:

$ jobstats job.id

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: job.id
   User/Account: user.name/pi.group
       Job Name: job.name
          State: RUNNING
          Nodes: 1
      CPU Cores: 1
     CPU Memory: 64GB (64GB per CPU-core)
           GPUs: 1
  QOS/Partition: gpu/l40s
        Cluster: hpc2021
     Start Time: Sat Jan 11, 2025 at 1:11 PM
       Run Time: 01:44:12 (in progress)
     Time Limit: 1-00:00:00

                          Overall Utilization
================================================================================
  CPU utilization  [|||||||||||||||||||||||||||||||||||||||||||||||99%] <- Exact value
  CPU memory usage [|                                               2%] <- Peak value
  GPU utilization  [||||||||||||||||||||||||||||||||||||||||||     84%] <- Average value
  GPU memory usage [||||||||                                       17%] <- Peak value

                          Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      SPG-4-4: 01:42:43/01:44:12 (efficiency=98.6%)

  CPU memory usage per node - used/allocated
      SPG-4-4: 1.2GB/64GB (1.2GB/64GB per core of 1)

  GPU utilization per node
      SPG-4-4 (GPU 7): 84.5%

  GPU memory usage per node - maximum used/total
      SPG-4-4 (GPU 7): 7.5GB/45GB (16.6%)

                                 Notes
================================================================================
  * This job only used 2% of the 64GB of total allocated CPU memory. For
    future jobs, please allocate less memory by using a Slurm directive such
    as --mem-per-cpu=2G or --mem=2G. This will reduce your queue times and
    make the resources available to other users. For more info:
      https://researchcomputing.princeton.edu/support/knowledge-base/memory

  * Have a nice day!

[1] https://github.com/PrincetonUniversity/jobstats