SLURM job script

To execute a program in the cluster system, a user has to write a batch script and submit it to the SLURM job scheduler. Sample of general SLURM scripts are located in each user’s hpc2021 home directory ~/slurm-samples and user guide for individual software can be referenced.

Sample job script

In the example job script (script.cmd) below, it requests the following resources and actual programs/commands to be executed.

Name the job as “pilot_study” for easy reference.
Request for notifications to be emailed to tmchan@hku.hk when job starts, ends or fails.
Request for the “amd” partition (i.e. general compute nodes with AMD CPUs).
Request for “normal” QoS.
Request for allocation of 64 CPU cores contributed from a total of one compute node.
Request for 10GB physical RAM.
Request for job execution 3 days and 10 hours ( The job would be terminated by SLURM after the specified amount of time no matter it has finished or not).
Write the standard output and standard error to the file “pilot_study_2021.out” and “pilot_study_2021.err” respectively under the folder where the job is submitted. The path supports the use of replacement symbols.

#!/bin/bash
#SBATCH --job-name=pilot_study        # 1. Job name
#SBATCH --mail-type=BEGIN,END,FAIL    # 2. Send email upon events (Options: NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=tmchan@hku.hk     #    Email address to receive notification
#SBATCH --partition=amd               # 3. Request a partition
#SBATCH --qos=normal                  # 4. Request a QoS
#SBATCH --ntasks=64                   # 5. Request total number of tasks (MPI workers)
#SBATCH --nodes=1                     #    Request number of node(s)
#SBATCH --mem=10G                     # 6. Request total amount of RAM
#SBATCH --time=3-10:00:00             # 7. Job execution duration limit day-hour:min:sec
#SBATCH --output=%x_%j.out            # 8. Standard output log as $job_name_$job_id.out
#SBATCH --error=%x_%j.err             #    Standard error log as $job_name_$job_id.err
 
# print the start time
date
command1 ...
command2 ...
command3 ...
# print the end time
date

SLURM Job Directives

A SLURM script includes a list of SLURM job directives at the top of the file, where each line starts with #SBATCH followed by option name to value pairs to tell the job scheduler the resources that a job requests.

Long Option	Short Option	Default value	Description
`--job-name`	`-J`	file name of job script	User defined name to identify a job
`--partition`	`-p`	intel	Partition where a job to be executed
`--time`	`-t`	24:00:00	Specify a limit on the maximum execution time (walltime) for the job (D-HH:MM:SS) . For example, -t 1- is one day, -t 6:00:00 is 6 hours
`--nodes`	`-N`		Total number of node(s)
`--ntasks`	`-n`	1	Number of tasks (MPI workers)
`--ntasks-per-node`			Number of tasks per node
`--cpus-per-task`	`-c`	1	Number of CPUs required per task
`--mem`			Amount of memory allocated per node. Different units can be specified using the suffix [K\|M\|G\|T]
`--mem-per-cpu`		3G	Amount of memory allocated per cpu per code (For multicore jobs). Different units can be specified using the suffix [K\|M\|G\|T]
`--constraint`	`-C`		Nodes with requested features. Multiple constraints may be specified with AND, OR, Matching OR. For example, `--constraint="CPU_MNF:AMD"`, `--constraint="CPU_MNF:INTEL&CPU_GEN:CLX"`
`--exclude`	`-x`		Explicitly exclude certain nodes from the resources granted to the job. For example, `--exclude=SPG-2-[1-3]`, `--exclude=SPG-2-1,SPG-2-2,SPG-2-3`

More SLURM directives are available here.

Running Serial / Single Threaded Jobs using a CPU on a node

Serial or single CPU core jobs are those jobs that can only make use of one CPU on a node. A SLURM batch script below will request for a single CPU on a node with default amount of RAM (i.e. 3GB) for 30 minutes in default partition (i.e. “intel).

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
command1...

Running Multi-threaded Jobs using multiple CPU cores on a node

For those jobs that can leverage multiple CPU cores on a node via creating multiple threads within a process (e.g. OpenMP), a SLURM batch script below may be used that requests for allocation to a task with 8 CPU cores on a single node and 6GB RAM per core (Totally 6GB x 8 = 48GB RAM on a node ) for 1 hour in default partition (i.e. “intel) and default qos (i.e. “normal”).

❗ --cpus-per-task should be no more than the number of cores on a compute node you request. You may want to experiment with the number of threads for your job to determine the optimal number, as computational speed does not always increase with more threads. Note that if --cpus-per-task is fewer than the number of cores on a node, your job will not make full use of the node.

❗For program that can only use a single CPU, requesting for more CPU will NOT make it run faster but will likely make the queuing time longer.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=6G
#SBATCH --time=01:00:00

# For jobs supporting OpenMP, assign the value of the requested CPU cores to the OMP_NUM_THREADS variable
# that would be automatically passed to your command supporting OpenMP
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
command1 ...

# For jobs not supporting OpenMP, supply the value of the requested CPU cores as command-line argument to the command
command2 -t ${SLURM_CPUS_PER_TASK} ...

Running MPI jobs using multiple nodes

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to allow for execution of programs using CPUs on multiple nodes where CPUs across nodes communicate over the network. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. Intel MPI and OpenMPI are available in HPC2021 system and SLURM jobs may make use of either MPI implementations.

❗Requesting for multiple nodes and /or loading any MPI modules may not necessarily make your code faster, your code must be MPI aware to use MPI. Even though running a non-MPI code with mpirun might possibly succeed, you will most likely have every core assigned to your job running the exact computation, duplicating each others work, and wasting resources.

❗The version of the MPI commands you run must match the version of the MPI library used in compiling your code, or your job is likely to fail. And the version of the MPI daemons started on all the nodes for your job must also match. For example, an MPI program compiled with Intel MPI compilers should be executed using Intel MPI runtime instead of Open MPI runtime.

A SLURM batch script below requests for allocation to 64 tasks (MPI processes) each use a single core from two nodes and 3GB RAM per core for 1 hour in default partition (i.e. “intel) and default qos (i.e. “normal”).

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=3G
#SBATCH --time=01:00:00

cd ${SLURM_SUBMIT_DIR}
# Load the environment for Intel MPI
module load impi/2021.4

# run the program supporting MPI with the "mpirun" command
# The -n option is not required since mpirun will automatically determine from SLURM settings
mpirun ./program_mpi

This example make use of all the cores on two, 32-core nodes in the “intel” partition. If same number of tasks (i.e. 64) is requested from partition “amd”, you should set “--nodes=1” so that all 64 cores will be allocated from a single AMD (64-core or 128-core) node . Otherwise, SLURM will assign 64 CPUs from 2 compute nodes which would induce unnecessary inter-node communication overhead.

Running hybrid OpenMP/MPI jobs using multiple nodes

For those jobs that support both OpenMP and MPI, a SLURM batch script may specify the number of MPI tasks to run and the number of CPU core that each task should use. A SLURM batch script below requests for allocation of 2 nodes and 64 CPU cores in total for 1 hour in default partition (i.e. “intel) and default qos (i.e. “normal”). Each compute node runs 2 MPI tasks, where each MPI task uses 16 CPU core and each core uses 3GB RAM. This would make use of all the cores on two, 32-core nodes in the “intel” partition.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=3G
#SBATCH --time=01:00:00

cd ${SLURM_SUBMIT_DIR}
# Load the environment for Intel MPI
module load impi/2021.4

# assign the value of the requested CPU cores per task to the OMP_NUM_THREADS variable
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# run the program supporting MPI with the "mpirun" command.
# The -n option is not required since mpirun will automatically determine from SLURM settings
mpirun ./program_mpi-omp

Sample MPI & hybrid MPI/OpenMP codes and the corresponding SLURM scripts are available at user home directory ~/slurm-samples/demo-MPI/.

Running jobs using GPU

A SLURM batch script below request for 8 CPU cores and 2 GPU cards from one compute node in the “gpu” partition using “gpu” qos.
❗Your code must be GPU aware to benefit from nodes with GPU, otherwise other partition without GPU should be used.
❗L40/L40S only has 1/5 of double precision performance but 3x of single/half precision performance compared to V100.

GPU	SLURM partition
V100	gpu
L40	condo_gpu/c_foss_gpu*
L40S	l40s

* Partitions starting with “c_” are reserved for the owner of the machines. Normal user should use “condo_gpu”. Jobs from the “condo_” partitions will be re-queued if the owners’ job have been queued for 1 hour.

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:2

# Load the environment module for Nvidia CUDA
module load cuda
gpu_program gpu=y ...

Running jobs requiring a large amount of RAM

A SLURM batch script below request for 64 CPU cores (default one CPU core per task) on single node with AMD EPYC 7742 CPU in the “amd” partition and a total of 300GB RAM. This would make use of 64 cores on a 128-core AMD node. If “--nodes=1" is not defined, 64 cores may be assigned from separate compute nodes that may result in performance drop due to (MPI) inter-node communication overhead or unused CPUs if program do not support multi-node parallelization.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=64
#SBATCH --partition=amd
#SBATCH --mem=300G
#SBATCH --constraint="CPU_SKU:7742"

When the requested resources are not available and/or the limits in the QoS are exceeded, a submitted job is put in pending state. By the time a pending job is eligible to execute in the cluster, SLURM will allocate the requested resources to the job for the duration of the requested wall time and any commands put after the last SLURM directives (#SBATCH) in the script file will be executed.