SLURM Job Management

After a job is submitted to SLURM, user may check the job status with commands sq or showq as described below.

Show any running/pending jobs

$ sq
JOBID PARTITION NAME   ST USER     QOS   NODES CPUS TRES_PER_NODE TIME_LIMIT TIME_LEFT NODELIST(REASON) 
123   intel     test1  R  hku_user normal 1     32   N/A          4-00:00:00 3-21:21:20 GPI-1-19 
124   gpu       para_g R  hku_user gpu    1     8   gpu:2         4-00:00:00 3-21:29:39 SPG-1-1

Show specific job, sq -j <JobID>

$ sq -j 123456

Show jobs in a specific partition, sq -p <partition>

$ sq -p intel

Show running job

$ sq -t R

Show pending job

$ sq -t PD

Job information provided

JOBID: Job ID
PARTITION: Partition
NAME: Job name given
ST (status):

Status	Description
R	Running
PD	Pending (queuing)
CD	Completed (exit code 0 — without error)
F	Failure (exit code non-zero)
DL	Failure (job terminated on deadline)

NODES: Number of nodes requested
CPUs: Number of CPUs requested
TRES_PER_NODE: Resources
TIME_TIME: Requested wall time
TIME_LEFT: Remaining wall time
NODELIST: List of the nodes which the job is using
NODELIST(REASON): Show the reason that explain the current job status

Reason	Description
Priority	The job is waiting for higher priority job(s) to complete
Dependency	The job is waiting for a dependent job to complete
Resources	The job is waiting for resources to become available
InvalidQoS	The job’s QoS is invalid. Cancel it and rerun with correct QoS
QOSGrpMaxJobsLimit	Maximum number of jobs for your job’s QoS are in use
PartitionCpuLimit	All CPU assigned to your jobs’ specified partition are in use
PartitionMaxJobsLimit	Maximum number of jobs for your job’s specified partition are have been met

$ showq
SUMMARY OF JOBS FOR USER: <hku_user> 
ACTIVE JOBS-------------------- 
JOBID     JOBNAME   USERNAME     STATE   CORE   NODE QUEUE         REMAINING STARTTIME 
=================================================================================================== 
10721     hpl       hku_user     Running 64     2   intel           2:06:56 Mon Aug  9 17:50:21 
WAITING JOBS------------------------ 
JOBID     JOBNAME   USERNAME     STATE   CORE HOST QUEUE           WCLIMIT QUEUETIME
=================================================================================================== 
Total Jobs: 1     Active Jobs: 1     Idle Jobs: 0     Blocked Jobs: 0

Delete / cancel a job

$ scancel <JobID>

Delete / cancel all jobs for a user

$ scancel -u <Username>

Update attributes of submitted jobs

Update walltime request of a queuing job (a job which is pending and not yet start to run) to 1 hour. Requested walltime can only be updated to be shorter once it is running.

$ scontrol update jobid=<JobID> TimeLimit=01:00:00

Check Partition/Node Usage

User can use command plist to check the status of partitions and nodes

$ plist
PARTITION NODES NODES(A/I/O/T) S:C:T MEMORY   TIMELIMIT  AVAIL_FEATURES            NODELIST 
intel*    84    57/25/2/84     2:16:1 192000   4-00:00:00 CPU_MNF:INTEL,CPU_SKU:622 GPI-1-[1-20],GPI-2-[1-64] 
amd       28    16/12/0/28     2:64:1 512000   4-00:00:00 CPU_MNF:AMD,CPU_SKU:7742, GPA-2-[1-28] 
amd       28    16/12/0/28     2:32:1 256000   4-00:00:00 CPU_MNF:AMD,CPU_SKU:7542, GPA-1-[1-28] 
gpu       7     6/1/0/7        2:16:1 384000   7-00:00:00 CPU_MNF:INTEL,CPU_SKU:622 SPG-1-[1-4],SPG-2-[1-3] 
hugemem   2     1/1/0/2        2:64:1 2048000  7-00:00:00 CPU_MNF:AMD,CPU_SKU:7742, SPH-1-[1-2]

where

NODES(A/I/O/T) shows the count of nodes of state “allocated/idle/other/total”
S:C:T shows count of sockets (S), cores (C) per socket and threads (T) per core on the nodes
AVAIL_FEATURES gives the node features which can be used as “Constraint”