SLURM Job Management

After a job is submitted to SLURM, user may check the job status with commands sq or showq as described below.

Show any running/pending jobs
$ sq
JOBID PARTITION NAME   ST USER     QOS   NODES CPUS TRES_PER_NODE TIME_LIMIT TIME_LEFT NODELIST(REASON) 
123   intel     test1  R  hku_user normal 1     32   N/A          4-00:00:00 3-21:21:20 GPI-1-19 
124   gpu       para_g R  hku_user gpu    1     8   gpu:2         4-00:00:00 3-21:29:39 SPG-1-1
Show specific job, sq -j <JobID>
$ sq -j 123456
Show jobs in a specific partition, sq -p <partition>
$ sq -p intel
Show running job
$ sq -t R
Show pending job
$ sq -t PD
Job information provided
  • JOBID: Job ID
  • PARTITION: Partition
  • NAME: Job name given
  • ST (status):
Status Description
R Running
PD Pending (queuing)
CD Completed (exit code 0 — without error)
F Failure (exit code non-zero)
DL Failure (job terminated on deadline)
  • NODES: Number of nodes requested
  • CPUs: Number of CPUs requested
  • TRES_PER_NODE: Resources
  • TIME_TIME: Requested wall time
  • TIME_LEFT: Remaining wall time
  • NODELIST: List of the nodes which the job is using
  • NODELIST(REASON): Show the reason that explain the current job status
Reason Description
Priority The job is waiting for higher priority job(s) to complete
Dependency The job is waiting for a dependent job to complete
Resources The job is waiting for resources to become available
InvalidQoS The job’s QoS is invalid. Cancel it and rerun with correct QoS
QOSGrpMaxJobsLimit Maximum number of jobs for your job’s QoS are in use
PartitionCpuLimit All CPU assigned to your jobs’ specified partition are in use
PartitionMaxJobsLimit Maximum number of jobs for your job’s specified partition are have been met
$ showq
SUMMARY OF JOBS FOR USER: <hku_user> 
ACTIVE JOBS-------------------- 
JOBID     JOBNAME   USERNAME     STATE   CORE   NODE QUEUE         REMAINING STARTTIME 
=================================================================================================== 
10721     hpl       hku_user     Running 64     2   intel           2:06:56 Mon Aug  9 17:50:21 
WAITING JOBS------------------------ 
JOBID     JOBNAME   USERNAME     STATE   CORE HOST QUEUE           WCLIMIT QUEUETIME
=================================================================================================== 
Total Jobs: 1     Active Jobs: 1     Idle Jobs: 0     Blocked Jobs: 0

Delete / cancel a job

$ scancel <JobID>

Delete / cancel all jobs for a user

$ scancel -u <Username>

Update attributes of submitted jobs

Update walltime request of a queuing job (a job which is pending and not yet start to run) to 1 hour. Requested walltime can only be updated to be shorter once it is running.

$ scontrol update jobid=<JobID> TimeLimit=01:00:00

Check Partition/Node Usage

User can use command plist to check the status of partitions and nodes

$ plist
PARTITION NODES NODES(A/I/O/T) S:C:T MEMORY   TIMELIMIT  AVAIL_FEATURES            NODELIST 
intel*    84    57/25/2/84     2:16:1 192000   4-00:00:00 CPU_MNF:INTEL,CPU_SKU:622 GPI-1-[1-20],GPI-2-[1-64] 
amd       28    16/12/0/28     2:64:1 512000   4-00:00:00 CPU_MNF:AMD,CPU_SKU:7742, GPA-2-[1-28] 
amd       28    16/12/0/28     2:32:1 256000   4-00:00:00 CPU_MNF:AMD,CPU_SKU:7542, GPA-1-[1-28] 
gpu       7     6/1/0/7        2:16:1 384000   7-00:00:00 CPU_MNF:INTEL,CPU_SKU:622 SPG-1-[1-4],SPG-2-[1-3] 
hugemem   2     1/1/0/2        2:64:1 2048000  7-00:00:00 CPU_MNF:AMD,CPU_SKU:7742, SPH-1-[1-2]

where

  • NODES(A/I/O/T) shows the count of nodes of state “allocated/idle/other/total”
  • S:C:T shows count of sockets (S), cores (C) per socket and threads (T) per core on the nodes
  • AVAIL_FEATURES gives the node features which can be used as “Constraint”