Login to HPC2021 system
HPC2021, as a high performance computing (HPC) cluster, is a collection of computers (compute nodes) with fast multi-core CPUs, ample amount of system memory (RAM) and an array of redundant disk storage pooled together via a high speed network. While those compute nodes are behind the scene for execution of compute tasks where users have no direct access, we provide a number of login nodes for users to directly login and interact with the system for tasks like job submission and file transfer. All nodes in HPC2021, including login nodes and compute nodes have access to a high performance data storage shared over the network such that job running across nodes may access the same data.
To interact with the HPC system via a terminal (a window for user to type commands), a user may run a secure shell (SSH) client software on a local device (e.g. PC) to remotely connect to any login nodes below where all network traffic is encrypted to allow for secure communications. For the sake of data and system security, terminal and file access to the login nodes are only allowed within the HKU campus network and prior connection to the HKUVPN2FA is needed for off-campus access.
Login node | Role | Usage |
---|---|---|
hpc2021.hku.hk | Head Node | Reserved for file editing, compilation and job submission/management |
hpc2021-io1.hku.hk hpc2021-io2.hku.hk |
IO Node | Reserved for file transfer, file management and data analysis/visualization |
Data Storage
Nodes in HPC2021 have access to various kind of storage devices designed for different purposes and workload summarized below. Users may choose a storage folder appropriate for a particular use case.
Storage Type | $HOME | $WORK | PI Group Share (software) | PI Group Share (lustre) | $TMP_DIR |
---|---|---|---|---|---|
Path | /home/$USER |
/scr/u/$USER |
/group/$PI_GROUP |
/lustre1/g/$PI_GROUP |
/tmp |
Usage | Long term, small size | Short term, small size | Long term, software shared between members in a research group | Moderate term, high performance, shared between members in a research group | Short term (for the duration a job is being executed), high performance |
Availability | Available across any nodes in HPC2021 | Available across any nodes in HPC2021 | Available across any nodes in HPC2021 | Available across any nodes in HPC2021 | Available on the attached node only |
Capacity | 100GB per user | 500GB per user | 100GB per PI group |
|
|
Performance | Moderate – Not appropriate for workload requiring high throughput or small file operations | Moderate:
|
Moderate
|
High throughput for large files and IO:
|
High performance, especially in terms of operation of small files |
Clean-up Policy | No scheduled clean-up | No scheduled clean-up | No scheduled clean-up | No scheduled clean-up | Cleaned-up immediately upon job termination |
Backup | Daily (Offsite, accessible by administrator only) | Nil (Be reminded to move important data to $HOME ) |
Nil (Be reminded to move important data to $HOME ) |
Nil (Be reminded to move important data to $HOME ) |
Nil ( Be reminded to move important data to $HOME ) |
Snapshots | Daily, kept up to 7 days (Onsite, user-accessible rollback) | Nil | Weekly, kept up to 4 weeks (Onsite, user-accessible rollback) | Nil | Nil |
Good for | Lightly accessed files, such as scripts, programs and raw data for individual user | Small to medium files, scripts, temporary files | Sharing software among members in a PI group | Large intermediate output files from jobs (especially parallel IO) and sharing of large files among members in a PI group | Large amount of small temporary files assessed during job execution |
Remarks | Network File System (NFS) with Pure Storage as underlying file system, with scheduled daily snapshots allowing users to rollback the files/folder to previous state | PI can request more storage via ITS form CF162 | Files under /tmp is local and NOT shared via the network to other nodes and hence they are not accessible by other nodes in the cluster |
Best Practice of using Lustre File System (/lustre1/)
While Lustre File System (i.e. /lustre1) provides high bandwidth and low latency for demanding workload involving large files and file operations, operations involving massive amount of small files or repetitive file attribute access would result in a drastic drop in performances with such distributed file system.
Below are some recommendations about the best practices in using Lustre File system that users may take note in order to get the most out of the storage.
- Avoid using “-l” option in “ls” command inside /lustre1
- Avoid having a large number of files in a single directory inside /lustre1
- Avoid accessing small files inside /lustre1
- Keep your source code and executables under /home instead of /lustre1
- Avoid repetitive “stat” operations against files inside /lustre1
- Avoid repetitive open/close operations against files inside /lustre1
References:
Checking disk quota and usage
The command diskquota
will show disk quota and usage.
In the unfortunate case that files are accidentally deleted/modified, user may rollback from daily snapshots. More details can be found here.
Centrally installed software
Common software application and utilities are centrally installed in the system accessible to users via Environment Modules. Environment Modules allow users to dynamically configure a shell environment required for the execution of specific applications. It additionally caters for the dependency and conflicts between the environments required for different application or application versions.
Detailed usage and list of available software are available here.
Job Scheduler
Upon login to the login node, users do not directly run their analysis applications there as the login node is not designed or powerful enough to run heavy workload. Instead, users shall run their analysis on the compute nodes that are actual workhorse of an HPC system via a job scheduler. The scheduler we use to manage the submission, scheduling and management of jobs is called SLURM. On a login node, user writes a batch script and submit it to the queue manager to schedule for execution in the compute nodes. The submitted job then queue up until the requested system resources is allocated. The queue manager will schedule a job to run on the queue (or partition in SLURM) according to a predetermined site policy designated to balance competing user needs and to maximize efficient use of cluster resources.
Detailed usage are available here.