SLURM Job script
To execute a program in the cluster system, a user has to write a batch script and submit it to the SLURM job scheduler. Sample of general SLURM scripts are located in each user’s hpc2021 home directory ~/slurm-samples and user guide for individual software can be referenced.
Sample job script
In the example job script (script.cmd) below, it requests the following resources and actual programs/commands to be executed.
- Name the job as “pilot_study” for easy reference.
- Request for notifications to be emailed to email@example.com when job starts, ends or fails.
- Request for the “amd” partition (i.e. general compute nodes with AMD CPUs).
- Request for “normal” QoS.
- Request for allocation of 64 CPU cores contributed from a total of one compute node.
- Request for 10GB physical RAM.
- Request for job execution 3 days and 10 hours ( The job would be terminated by SLURM after the specified amount of time no matter it has finished or not).
- Write the standard output and standard error to the file “pilot_study_2021.out” and “pilot_study_2021.err” respectively under the folder where the job is submitted. The path supports the use of replacement symbols.
SLURM Job Directives
A SLURM script includes a list of SLURM job directives at the top of the file, where each line starts with
#SBATCH followed by option name to value pairs to tell the job scheduler the resources that a job requests.
|Long Option||Short Option||Default value||Description|
||file name of job script||User defined name to identify a job|
||intel||Partition where a job to be executed|
||24:00:00||Specify a limit on the maximum execution time (walltime) for the job (D-HH:MM:SS) .
For example, -t 1- is one day, -t 6:00:00 is 6 hours
||Total number of node(s)|
||1||Number of tasks (MPI workers)|
||Number of tasks per node|
||1||Number of CPUs required per task|
||Amount of memory allocated per node. Different units can be specified using the suffix [K|M|G|T]|
||3G||Amount of memory allocated per cpu per code (For multicore jobs). Different units can be specified using the suffix [K|M|G|T]|
||Nodes with requested features. Multiple constraints may be specified with AND, OR, Matching OR. For example,
||Explicitly exclude certain nodes from the resources granted to the job. For example,
More SLURM directives are available here.
Running Serial / Single Threaded Jobs using a CPU on a node
Serial or single CPU core jobs are those jobs that can only make use of one CPU on a node. A SLURM batch script below will request for a single CPU on a node with default amount of RAM (i.e. 3GB) for 30 minutes in default partition (i.e. “intel).
Running Multi-threaded Jobs using multiple CPU cores on a node
For those jobs that can leverage multiple CPU cores on a node via creating multiple threads within a process (e.g. OpenMP), a SLURM batch script below may be used that requests for allocation to a task with 8 CPU cores on a single node and 6GB RAM per core (Totally 6GB x 8 = 48GB RAM on a node ) for 1 hour in default partition (i.e. “intel) and default qos (i.e. “normal”).
Running MPI jobs using multiple nodes
Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to allow for execution of programs using CPUs on multiple nodes where CPUs across nodes communicate over the network. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. Intel MPI and OpenMPI are available in HPC2021 system and SLURM jobs may make use of either MPI implementations.
❗Requesting for multiple nodes and /or loading any MPI modules may not necessarily make your code faster, your code must be MPI aware to use MPI. Even though running a non-MPI code with mpirun might possibly succeed, you will most likely have every core assigned to your job running the exact computation, duplicating each others work, and wasting resources.
❗The version of the MPI commands you run must match the version of the MPI library used in compiling your code, or your job is likely to fail. And the version of the MPI daemons started on all the nodes for your job must also match. For example, an MPI program compiled with Intel MPI compilers should be executed using Intel MPI runtime instead of Open MPI runtime.
A SLURM batch script below requests for allocation to 64 tasks (MPI processes) each use a single core from two nodes and 3GB RAM per core for 1 hour in default partition (i.e. “intel) and default qos (i.e. “normal”).
This example make use of all the cores on two, 32-core nodes in the “intel” partition. If same number of tasks (i.e. 64) is requested from partition “amd”, you should set “
--nodes=1” so that all 64 cores will be allocated from a single AMD (64-core or 128-core) node . Otherwise, SLURM will assign 64 CPUs from 2 compute nodes which would induce unnecessary inter-node communication overhead.
Running hybrid OpenMP/MPI jobs using multiple nodes
For those jobs that support both OpenMP and MPI, a SLURM batch script may specify the number of MPI tasks to run and the number of CPU core that each task should use. A SLURM batch script below requests for allocation of 2 nodes and 64 CPU cores in total for 1 hour in default partition (i.e. “intel) and default qos (i.e. “normal”). Each compute node runs 2 MPI tasks, where each MPI task uses 16 CPU core and each core uses 3GB RAM. This would make use of all the cores on two, 32-core nodes in the “intel” partition.
Sample MPI & hybrid MPI/OpenMP codes and the corresponding SLURM scripts are available at user home directory ~/slurm-samples/demo-MPI/.
Running jobs using GPU
A SLURM batch script below request for 8 CPU cores and 2 GPU cards from one compute node in the “gpu” partition using “gpu” qos.
❗Your code must be GPU aware to benefit from nodes with GPU, otherwise other partition without GPU should be used.
Running jobs requiring a large amount of RAM
A SLURM batch script below request for 64 CPU cores (default one CPU core per task) on single node with AMD EPYC 7742 CPU in the “amd” partition and a total of 300GB RAM. This would make use of 64 cores on a 128-core AMD node. If “
--nodes=1" is not defined, 64 cores may be assigned from separate compute nodes that may result in performance drop due to (MPI) inter-node communication overhead or unused CPUs if program do not support multi-node parallelization.
When the requested resources are not available and/or the limits in the QoS are exceeded, a submitted job is put in pending state. By the time a pending job is eligible to execute in the cluster, SLURM will allocate the requested resources to the job for the duration of the requested wall time and any commands put after the last SLURM directives (
#SBATCH) in the script file will be executed.