Partitions

A partition is a set of compute nodes grouped logically based on their hardware features. The table below shows the available partitions and their properties / features in HPC2021 and AI-Research systems respectively.

For AI-Research System
Partition Default / Max Job duration # of nodes cores per node RAM(GB) per node RAM(GB) per core Features
debug 4 Days 1 256 1024 4 EPYC7742
For HPC2021 System
Partition Default / Max Job duration # of nodes cores per node RAM(GB) per node RAM(GB) per core Features
intel (default) 1 Day / 1 Week 84 32 192 6 GOLD6626R
amd 1 Day / 1 Week 28 64 256 4 EPYC7542
28 128 512 4 EPYC7742
10 192 768 4 EPYC9654
gpu 1 Day / 1 Week 4 32 384 12 4x V100
3 32 384 12 8x V100
2 64 512 8 8x L40S
hugemem 1 Day / 1 Week 2 128 2048 16 EPYC7742 + 2TB RAM
condo_amd/c_mehpc3* 1 Day / 1 Week 8 128 512 4 EPYC7742
condo_amd/c_foss_amd* 1 Day / 1 Week 8 192 768 4 EPYC9654
condo_gpu/c_foss_gpu* 1 Day / 1 Week 2 64 1024 16 10x L40

* Partitions starting with “c_” are reserved for the owner of the machines. Normal user should use either “condo_amd” or “condo_gpu”.


Quality of Service (QoS)

Each QoS is assigned a set of limits to be applied to the job, dictating the limit in the resources and partitions that a job is entitled to request. The table below shows the available QoS in HPC2021 and their allowed partitions / resources limits.

For AI-Research System
QoS Supported Partition(s) Max Job Duration Max Resources per job
debug (default) debug 4 days
For HPC2021 System
QoS Supported Partition(s) Max Job Duration Max Resources per job
debug intel, amd, gpu 30min 2 nodes, 2 GPUs
normal (default) intel, amd 1 Week 1024 cores
long intel, amd 2 Weeks 1 node
^ special intel, amd 1 Day 2048 cores
^ gpu gpu 1 Week 1 node, 4 GPUs
^ hugemem hugemem 1 Week 1 node, 2TB RAM

^ Require special approval

Users are advised to specify a suitable QoS depending on the job’s requirement.

  • For those jobs supporting parallel computing that utilizes computing resources across multiple nodes (e.g. via MPI) , then the “normal” QoS is a desirable one as the job may request for a handful of CPU cores.
  • For those serial jobs or multi-threaded (OpenMP) jobs that can only be executed on a single node and it is expected to take a longer running time, then the “long” QoS is a more preferable one as the job may request for a node with a longer job duration (up to two weeks).