Partitions
A partition is a set of compute nodes grouped logically based on their hardware features. The table below shows the available partitions and their properties / features in HPC2021 and AI-Research systems respectively.
For AI-Research System
| Partition | Default / Max Job duration | # of nodes | cores per node | RAM(GB) per node | RAM(GB) per core | Features |
|---|---|---|---|---|---|---|
| debug | 4 Days | 1 | 256 | 1024 | 4 | 8x A100 (40GB) |
For HPC2021 System
| Partition | Default / Max Job duration | # of nodes | cores per node | RAM(GB) per node | RAM(GB) per core | Features |
|---|---|---|---|---|---|---|
| intel (default) | 1 Day / 1 Week | 84 | 32 | 192 | 6 | GOLD6626R |
| amd | 1 Day / 1 Week | 28 | 64 | 256 | 4 | EPYC7542 |
| 28 | 128 | 512 | 4 | EPYC7742 | ||
| 10 | 192 | 768 | 4 | EPYC9654 | ||
| gpu | 1 Day / 1 Week | 4 | 32 | 384 | 12 | 4x V100 |
| 3 | 32 | 384 | 12 | 8x V100 | ||
| l40s | 1 Day / 1 Week | 5 | 64 | 512 | 8 | 8x L40S |
| 15 | 48 | 512 | 10.6 | 4x L40S | ||
| hugemem | 1 Day / 1 Week | 2 | 128 | 2048 | 16 | EPYC7742 + 2TB RAM |
| condo_amd/c_mehpc3* | 1 Day / 1 Week | 8 | 128 | 512 | 4 | EPYC7742 |
| condo_amd/c_foss_amd* | 1 Day / 1 Week | 8 | 192 | 768 | 4 | EPYC9654 |
| condo_gpu/c_foss_gpu* | 1 Day / 1 Week | 2 | 64 | 1024 | 16 | 10x L40 |
| 1 | 64 | 1024 | 16 | 8x L40 |
* Partitions starting with “c_” are reserved for the owner of the machines. Normal user should use either “condo_amd” or “condo_gpu”. Jobs from the “condo_” partitions will be re-queued if the owners’ job have been queued for 1 hour.
Quality of Service (QoS)
Each QoS is assigned a set of limits to be applied to the job, dictating the limit in the resources and partitions that a job is entitled to request. The table below shows the available QoS in HPC2021 and their allowed partitions / resources limits.
For AI-Research System
| QoS | Supported Partition(s) | Max Job Duration | Max Resources per job |
|---|---|---|---|
| normal (default) | debug | 4 days | 40 cores, 400GB RAM, 2 GPUs |
For HPC2021 System
| QoS | Supported Partition(s) | Max Job Duration | Max Resources per job |
|---|---|---|---|
| debug | intel, amd, gpu, l40s, condo_amd | 30min | 2 nodes, 2 GPUs |
| normal (default) | intel, amd, condo_amd | 1 Week | 1024 cores |
| long | intel, amd, condo_amd | 2 Weeks | 1 node |
| ^ special | intel, amd | 1 Day | 2048 cores |
| ^ gpu | gpu, l40s | 1 Week | 1 node, 4 GPUs |
| ^ hugemem | hugemem | 1 Week | 1 node, 2TB RAM |
^ Require special approval
Users are advised to specify a suitable QoS depending on the job’s requirement.
- For those jobs supporting parallel computing that utilizes computing resources across multiple nodes (e.g. via MPI) , then the “normal” QoS is a desirable one as the job may request for a handful of CPU cores.
- For those serial jobs or multi-threaded (OpenMP) jobs that can only be executed on a single node and it is expected to take a longer running time, then the “long” QoS is a more preferable one as the job may request for a node with a longer job duration (up to two weeks).
