SLURM Job Management
After a job is submitted to SLURM, user may check the job status with commands sq
or showq
as described below.
Show any running/pending jobs
$ sq JOBID PARTITION NAME ST USER QOS NODES CPUS TRES_PER_NODE TIME_LIMIT TIME_LEFT NODELIST(REASON) 123 intel test1 R hku_user normal 1 32 N/A 4-00:00:00 3-21:21:20 GPI-1-19 124 gpu para_g R hku_user gpu 1 8 gpu:2 4-00:00:00 3-21:29:39 SPG-1-1
Show specific job, sq -j <JobID>
$ sq -j 123456
Show jobs in a specific partition, sq -p <partition>
$ sq -p intel
Show running job
$ sq -t R
Show pending job
$ sq -t PD
Job information provided
- JOBID: Job ID
- PARTITION: Partition
- NAME: Job name given
- ST (status):
Status | Description |
---|---|
R | Running |
PD | Pending (queuing) |
CD | Completed (exit code 0 — without error) |
F | Failure (exit code non-zero) |
DL | Failure (job terminated on deadline) |
- NODES: Number of nodes requested
- CPUs: Number of CPUs requested
- TRES_PER_NODE: Resources
- TIME_TIME: Requested wall time
- TIME_LEFT: Remaining wall time
- NODELIST: List of the nodes which the job is using
- NODELIST(REASON): Show the reason that explain the current job status
Reason | Description |
---|---|
Priority | The job is waiting for higher priority job(s) to complete |
Dependency | The job is waiting for a dependent job to complete |
Resources | The job is waiting for resources to become available |
InvalidQoS | The job’s QoS is invalid. Cancel it and rerun with correct QoS |
QOSGrpMaxJobsLimit | Maximum number of jobs for your job’s QoS are in use |
PartitionCpuLimit | All CPU assigned to your jobs’ specified partition are in use |
PartitionMaxJobsLimit | Maximum number of jobs for your job’s specified partition are have been met |
$ showq SUMMARY OF JOBS FOR USER: <hku_user> ACTIVE JOBS-------------------- JOBID JOBNAME USERNAME STATE CORE NODE QUEUE REMAINING STARTTIME =================================================================================================== 10721 hpl hku_user Running 64 2 intel 2:06:56 Mon Aug 9 17:50:21 WAITING JOBS------------------------ JOBID JOBNAME USERNAME STATE CORE HOST QUEUE WCLIMIT QUEUETIME =================================================================================================== Total Jobs: 1 Active Jobs: 1 Idle Jobs: 0 Blocked Jobs: 0
Delete / cancel a job
$ scancel <JobID>
Delete / cancel all jobs for a user
$ scancel -u <Username>
Update attributes of submitted jobs
Update walltime request of a queuing job (a job which is pending and not yet start to run) to 1 hour. Requested walltime can only be updated to be shorter once it is running.
$ scontrol update jobid=<JobID> TimeLimit=01:00:00
Check Partition/Node Usage
User can use command plist
to check the status of partitions and nodes
$ plist PARTITION NODES NODES(A/I/O/T) S:C:T MEMORY TIMELIMIT AVAIL_FEATURES NODELIST intel* 84 57/25/2/84 2:16:1 192000 4-00:00:00 CPU_MNF:INTEL,CPU_SKU:622 GPI-1-[1-20],GPI-2-[1-64] amd 28 16/12/0/28 2:64:1 512000 4-00:00:00 CPU_MNF:AMD,CPU_SKU:7742, GPA-2-[1-28] amd 28 16/12/0/28 2:32:1 256000 4-00:00:00 CPU_MNF:AMD,CPU_SKU:7542, GPA-1-[1-28] gpu 7 6/1/0/7 2:16:1 384000 7-00:00:00 CPU_MNF:INTEL,CPU_SKU:622 SPG-1-[1-4],SPG-2-[1-3] hugemem 2 1/1/0/2 2:64:1 2048000 7-00:00:00 CPU_MNF:AMD,CPU_SKU:7742, SPH-1-[1-2]
where
- NODES(A/I/O/T) shows the count of nodes of state “allocated/idle/other/total”
- S:C:T shows count of sockets (S), cores (C) per socket and threads (T) per core on the nodes
- AVAIL_FEATURES gives the node features which can be used as “Constraint”