Partitions
A partition is a set of compute nodes grouped logically based on their hardware features. The table below shows the available partitions and their properties / features in HPC2021 and AI-Research systems respectively.
For AI-Research System
Partition | Default / Max Job duration | # of nodes | cores per node | RAM(GB) per node | RAM(GB) per core | Features |
---|---|---|---|---|---|---|
debug | 4 Days | 1 | 256 | 1024 | 4 | EPYC7742 |
For HPC2021 System
Partition | Default / Max Job duration | # of nodes | cores per node | RAM(GB) per node | RAM(GB) per core | Features |
---|---|---|---|---|---|---|
intel (default) | 1 Day / 1 Week | 84 | 32 | 192 | 6 | GOLD6626R |
amd | 1 Day / 1 Week | 28 | 64 | 256 | 4 | EPYC7542 |
28 | 128 | 512 | 4 | EPYC7742 | ||
10 | 192 | 768 | 4 | EPYC9654 | ||
gpu | 1 Day / 1 Week | 4 | 32 | 384 | 12 | 4x V100 |
3 | 32 | 384 | 12 | 8x V100 | ||
l40s | 1 Day / 1 Week | 5 | 64 | 512 | 8 | 8x L40S |
15 | 48 | 512 | 10.6 | 4x L40S | ||
hugemem | 1 Day / 1 Week | 2 | 128 | 2048 | 16 | EPYC7742 + 2TB RAM |
condo_amd/c_mehpc3* | 1 Day / 1 Week | 8 | 128 | 512 | 4 | EPYC7742 |
condo_amd/c_foss_amd* | 1 Day / 1 Week | 8 | 192 | 768 | 4 | EPYC9654 |
condo_gpu/c_foss_gpu* | 1 Day / 1 Week | 2 | 64 | 1024 | 16 | 10x L40 |
* Partitions starting with “c_” are reserved for the owner of the machines. Normal user should use either “condo_amd” or “condo_gpu”. Jobs from the “condo_” partitions will be re-queued if the owners’ job have been queued for 1 hour.
Quality of Service (QoS)
Each QoS is assigned a set of limits to be applied to the job, dictating the limit in the resources and partitions that a job is entitled to request. The table below shows the available QoS in HPC2021 and their allowed partitions / resources limits.
For AI-Research System
QoS | Supported Partition(s) | Max Job Duration | Max Resources per job |
---|---|---|---|
debug (default) | debug | 4 days |
For HPC2021 System
QoS | Supported Partition(s) | Max Job Duration | Max Resources per job |
---|---|---|---|
debug | intel, amd, gpu, l40s, condo_amd | 30min | 2 nodes, 2 GPUs |
normal (default) | intel, amd, condo_amd | 1 Week | 1024 cores |
long | intel, amd, condo_amd | 2 Weeks | 1 node |
^ special | intel, amd | 1 Day | 2048 cores |
^ gpu | gpu, l40s | 1 Week | 1 node, 4 GPUs |
^ hugemem | hugemem | 1 Week | 1 node, 2TB RAM |
^ Require special approval
Users are advised to specify a suitable QoS depending on the job’s requirement.
- For those jobs supporting parallel computing that utilizes computing resources across multiple nodes (e.g. via MPI) , then the “normal” QoS is a desirable one as the job may request for a handful of CPU cores.
- For those serial jobs or multi-threaded (OpenMP) jobs that can only be executed on a single node and it is expected to take a longer running time, then the “long” QoS is a more preferable one as the job may request for a node with a longer job duration (up to two weeks).