SLURM Partitions & QoS – Research Computing, HKU ITS

Partitions

A partition is a set of compute nodes grouped logically based on their hardware features. The table below shows the available partitions and their properties / features in HPC2021 and AI-Research systems respectively.

For AI-Research System

Partition	Default / Max Job duration	# of nodes	cores per node	RAM(GB) per node	RAM(GB) per core	Features
debug	4 Days	1	256	1024	4	EPYC7742

For HPC2021 System

Partition	Default / Max Job duration	# of nodes	cores per node	RAM(GB) per node	RAM(GB) per core	Features
intel (default)	1 Day / 1 Week	84	32	192	6	GOLD6626R
amd	1 Day / 1 Week	28	64	256	4	EPYC7542
		28	128	512	4	EPYC7742
		10	192	768	4	EPYC9654
gpu	1 Day / 1 Week	4	32	384	12	4x V100
		3	32	384	12	8x V100
l40s	1 Day / 1 Week	5	64	512	8	8x L40S
		15	48	512	10.6	4x L40S
hugemem	1 Day / 1 Week	2	128	2048	16	EPYC7742 + 2TB RAM
condo_amd/c_mehpc3*	1 Day / 1 Week	8	128	512	4	EPYC7742
condo_amd/c_foss_amd*	1 Day / 1 Week	8	192	768	4	EPYC9654
condo_gpu/c_foss_gpu*	1 Day / 1 Week	2	64	1024	16	10x L40
		1	64	1024	16	8x L40

* Partitions starting with “c_” are reserved for the owner of the machines. Normal user should use either “condo_amd” or “condo_gpu”. Jobs from the “condo_” partitions will be re-queued if the owners’ job have been queued for 1 hour.

Quality of Service (QoS)

Each QoS is assigned a set of limits to be applied to the job, dictating the limit in the resources and partitions that a job is entitled to request. The table below shows the available QoS in HPC2021 and their allowed partitions / resources limits.

For AI-Research System

QoS	Supported Partition(s)	Max Job Duration	Max Resources per job
debug (default)	debug	4 days

For HPC2021 System

QoS	Supported Partition(s)	Max Job Duration	Max Resources per job
debug	intel, amd, gpu, l40s, condo_amd	30min	2 nodes, 2 GPUs
normal (default)	intel, amd, condo_amd	1 Week	1024 cores
long	intel, amd, condo_amd	2 Weeks	1 node
^ special	intel, amd	1 Day	2048 cores
^ gpu	gpu, l40s	1 Week	1 node, 4 GPUs
^ hugemem	hugemem	1 Week	1 node, 2TB RAM

^ Require special approval

Users are advised to specify a suitable QoS depending on the job’s requirement.

For those jobs supporting parallel computing that utilizes computing resources across multiple nodes (e.g. via MPI) , then the “normal” QoS is a desirable one as the job may request for a handful of CPU cores.
For those serial jobs or multi-threaded (OpenMP) jobs that can only be executed on a single node and it is expected to take a longer running time, then the “long” QoS is a more preferable one as the job may request for a node with a longer job duration (up to two weeks).