SLURM Job Management
After a job is submitted to SLURM, user may check the job status with commands sq
or showq
as described below.
Show any running/pending jobs
Show specific job, sq -j <JobID>
$ sq -j 123456
Show jobs in a specific partition, sq -p <partition>
$ sq -p intel
Show running job
$ sq -t R
Show pending job
$ sq -t PD
Job information provided
- JOBID: Job ID
- PARTITION: Partition
- NAME: Job name given
- ST (status):
Status | Description |
---|---|
R | Running |
PD | Pending (queuing) |
CD | Completed (exit code 0 — without error) |
F | Failure (exit code non-zero) |
DL | Failure (job terminated on deadline) |
- NODES: Number of nodes requested
- CPUs: Number of CPUs requested
- TRES_PER_NODE: Resources
- TIME_TIME: Requested wall time
- TIME_LEFT: Remaining wall time
- NODELIST: List of the nodes which the job is using
- NODELIST(REASON): Show the reason that explain the current job status
Reason | Description |
---|---|
Priority | The job is waiting for higher priority job(s) to complete |
Dependency | The job is waiting for a dependent job to complete |
Resources | The job is waiting for resources to become available |
InvalidQoS | The job’s QoS is invalid. Cancel it and rerun with correct QoS |
QOSGrpMaxJobsLimit | Maximum number of jobs for your job’s QoS are in use |
PartitionCpuLimit | All CPU assigned to your jobs’ specified partition are in use |
PartitionMaxJobsLimit | Maximum number of jobs for your job’s specified partition are have been met |
Delete / cancel a job
$ scancel <JobID>
Delete / cancel all jobs for a user
$ scancel -u <Username>
Update attributes of submitted jobs
Update walltime request of a queuing job (a job which is pending and not yet start to run) to 1 hour. Requested walltime can only be updated to be shorter once it is running.
$ scontrol update jobid=<JobID> TimeLimit=01:00:00
Check Partition/Node Usage
User can use command plist
to check the status of partitions and nodes
where
- NODES(A/I/O/T) shows the count of nodes of state “allocated/idle/other/total”
- S:C:T shows count of sockets (S), cores (C) per socket and threads (T) per core on the nodes
- AVAIL_FEATURES gives the node features which can be used as “Constraint”