Jobstats

Jobstats [1] is a free and open-source job monitoring platform designed for CPU and GPU clusters that use the Slurm workload manager. The jobstats command generates a job report:

$ jobstats job.id

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: job.id
   User/Account: user.name/pi.group
       Job Name: job.name
          State: RUNNING
          Nodes: 1
      CPU Cores: 1
     CPU Memory: 64GB (64GB per CPU-core)
           GPUs: 1
  QOS/Partition: gpu/l40s
        Cluster: hpc2021
     Start Time: Sat Jan 11, 2025 at 1:11 PM
       Run Time: 01:44:12 (in progress)
     Time Limit: 1-00:00:00

                          Overall Utilization
================================================================================
  CPU utilization  [|||||||||||||||||||||||||||||||||||||||||||||||99%]
  CPU memory usage [|                                               2%]
  GPU utilization  [||||||||||||||||||||||||||||||||||||||||||     84%]
  GPU memory usage [||||||||                                       17%]

                          Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      SPG-4-4: 01:42:43/01:44:12 (efficiency=98.6%)

  CPU memory usage per node - used/allocated
      SPG-4-4: 1.2GB/64GB (1.2GB/64GB per core of 1)

  GPU utilization per node
      SPG-4-4 (GPU 7): 84.5%

  GPU memory usage per node - maximum used/total
      SPG-4-4 (GPU 7): 7.5GB/45GB (16.6%)

                                 Notes
================================================================================
  * This job only used 2% of the 64GB of total allocated CPU memory. For
    future jobs, please allocate less memory by using a Slurm directive such
    as --mem-per-cpu=2G or --mem=2G. This will reduce your queue times and
    make the resources available to other users. For more info:
      https://researchcomputing.princeton.edu/support/knowledge-base/memory

  * Have a nice day!

Reportseff

Reportseff [2] is a python script for tabular display of slurm efficiency information* that makes querying job efficiency easier for cases like “all jobs from a user over a certain period of time” or “all elements under an array jobs” when compares to commands like seff or jobstats. The followings are some examples for using the command reportseff:

# By default, reportseff will search for files in the working directory with patterns like slurm_%j.out, %x_%j, or slurm_%A_%a when no arguments are supplied
$ ls
  jobname_3062470.out
$ reportseff
                 JobID    State         Elapsed  TimeEff   CPUEff   MemEff
   jobname_3062470.out  COMPLETED    6-17:22:59   96.1%    96.4%    89.9%

# Custom pattern can be specified by using the option --slurm-format, any format token besides %A, %a or %j is interpreted by reportseff as .* regex
$ ls
  job-4340434_name.err
$ reportseff --slurm-format %x-%j_%z.err
                 JobID    State       Elapsed  TimeEff   CPUEff   MemEff
  job-4340434_name.err  COMPLETED    00:03:44   24.9%    29.8%    21.4%

# Array jobs must have %A_%a to properly function
$ ls
  job_5150515_1_tag.log  job_5150515_2_tag.log  job_5150515_3_tag.log
$ reportseff --slurm-format %x_%A_%a_%z
                  JobID    State       Elapsed  TimeEff   CPUEff   MemEff
  job_5150515_1_tag.log  COMPLETED    00:00:01   1.7%      ---      1.2%
  job_5150515_2_tag.log  COMPLETED    00:00:01   1.7%      ---      1.2%
  job_5150515_3_tag.log  COMPLETED    00:00:01   1.7%      ---      1.2%

# Show my jobs on partition l40s on past 7 days (default time window), adding gpu efficiency
$ reportseff --partition l40s --format +gpu
    JobID    State         Elapsed  TimeEff   CPUEff   MemEff   GPUEff   GPUMem
  6460646  COMPLETED    1-21:33:18   91.1%    95.9%    16.6%     9.2%    24.8%
  6460647   TIMEOUT     2-02:00:13  100.0%    95.2%    15.8%    10.1%    24.6%
  6460648  CANCELLED    1-01:37:33   51.3%    96.7%    27.4%     5.4%    24.6%
  6460649  CANCELLED    1-01:29:46   51.0%    96.7%    28.0%     4.9%    24.6%

# Show my COMPLETED jobs between 3 days ago and now, adding 2 fields (requested CPU and memory)
$ reportseff --since now-3days --until now -s CD --format +reqcpus,reqmem
    JobID    State         Elapsed  TimeEff   CPUEff   MemEff   ReqCPUS   ReqMem
  6960696  COMPLETED    6-18:31:23   96.7%    97.1%    89.4%      64       975G
  6960697  COMPLETED    6-23:03:54   99.4%    96.7%    95.7%      64       896G
  6960698  COMPLETED    6-11:50:39   92.8%    95.6%    90.8%      64       960G
  6960699  COMPLETED    6-18:16:44   96.6%    97.4%    89.8%      64       636G
  6960670  COMPLETED    6-17:22:59   96.1%    96.4%    89.9%      64       664G

* Efficiency information are only shown after the jobs are completed when using reportseff

[1] https://github.com/PrincetonUniversity/jobstats
[2] https://github.com/troycomi/reportseff