System Access
- Logging in the system
- Transferring Data File
- Changing Your Password
- Editing the Program
- Environment Modules
Resource Management System
Useful Commands
System Overview
The HPC2015 system is a heterogeneous High Performance Computing Linux cluster which comprises various kinds of computing resource: compute nodes with fast multicore processors for general compute-intensive computing; special purpose compute nodes with large memory, GPU and MIC capabilities for data-intensive and accelerated computing. It is designed to support both compute- and data-intensive research. The heterogeneous environments provide researchers diverse and emerging computing technologies to exploit new solution approaches, and new research opportunities and relationship among distinct research areas.
Reading of this user guide and using of the system assumes familiarity with the Linux/Unix software environment. In order to get an understanding of the Linux/UNIX, please study the UNIX user’s guide in the ITS web page.
System Access
To ensure a secure login session, user must connect to the HPC2015 system by secure shell SSH program through the HKU campus network. If you are outside the University network, you should connect to the HKU Campus network via HKUVPN with 2FA first. SSH is not bundled by MS Windows that you may require to download SSH client like PuTTY. Please visit SSH with Putty for more details (Host Name: hpc2015.hku.hk / hpc2015-file.hku.hk).
Logging in the system
To log in the HPC2015 system, use the two frontend nodes of hostname:
hpc2015.hku.hk, which is reserved for program modification, compilation and job queue submission/manipulation;
hpc2015-file.hku.hk, which is reserved for file transfer, file management and data analysis/visualization.
If you connect to the HPC2015 from a UNIX or Linux system with SSH:
ssh <username>@hpc2015.hku.hk or ssh -l <username> hpc2015.hku.hk
When you log on to the login node, you should be in your home directory ($HOME) which is also accessible by compute nodes. Do not use the frontend nodes for computationally intensive processes. These nodes are meant for compilation, program editing, simple data analysis and file management. All computational intensive jobs should be submitted and run though the job scheduling system.
Transferring Data File
Data transfer must be done using the secure commands scp/sftp or SCP client like WINSCP/FileZilla. You local machine must connect to HKU campus network beforehand. Please visit SSH and Secure File Transfer for procedure on how to make SSH/SFTP connection. Be reminded that only hpc2015-file.hku.hk is capable for file transfer.
If you already have some file(s) or folder(s) located at other Linux system and would like to copy them to HPC2015, you may use the SCP command:
scp $SRC_HOST:$SRC_FILE hpc2015-file.hku.hk:$DST_FILE
Changing Your Password
You can reset the account password by changing your HKU Portal PIN. Whenever your HKU Portal PIN is changed, the HPC2015 account password will be updated correspondingly.
Editing the Program
You can use the command vi, emacs, nano or pico to edit programs. Please refer to the UNIX user’s guide for detail.
Important notice for Microsoft Windows users:
Any files you transfer from Windows to Linux HPC system may be incompatible due to different sequences of control characters to mark the end of line (EOL). To change the format from DOS(Windows) to unix, you should use command dos2unix <filename>
Environment Modules
Applications, software, compilers, tools, communications libraries and math libraries of the cluster system are keep updating. HPC2015 uses the Environment Modules to dynamically set up environments for different applications. Module commands set, change, or delete environnment variables that are needed for particular application. The ‘module load’ command will set PATH, LD_LIBRARY_PATH and other environment variables. User can choose different version of applications or libraries more easily.
Useful Module commands
Command | Description |
---|---|
module list ml |
List currently loaded module(s) |
module avail ml avail |
Show what modules are available for loading |
module keyword [word1] [word2] … | Show available modules matching the search criteria |
module whatis [module_name] module help [module_name] |
Show description of particular module |
module load [module_name] module load [module_name]/[version] module load [mod A] [mod B] … |
Configure your environment according to modulefile(s) |
module unload [module_name] module unload [mod A] [mod B] … |
Roll back configuration performed by the modulefile(s) |
module swap [module A] [module B] | Unload modulefile A and load modulefile B |
module purge | Unload all modules currently loaded |
Using modules
You must remove some modules before loading others(e.g different MPI libraries) Some module depends on other, so they may be loaded or unloaded as a consequence of another module command.
If there is a set of modules that you are regularly use and want to be loaded at login, you can add the module command in the shell configuration file(.bashrc for bash users, .cshrc for C shell users).
Sometimes there may be caching error while listing/loading modules. You can delete the cache file and then run the module command again.
$ module avail
/usr/bin/lua: /home/[user]/.lmod.d/.cache/moduleT.x86_64_Linux.lua:function expression ...
$ rm -f /home/[user]/.lmod.d/.cache/*.lua
Resource Management System
Torque Resource Manager is a queue management system for managing and monitoring the computation workload of cluster system. User need to write a batch script and then submit it to the queue manager. The submitted jobs then queue up until the requested system resources is allocated. The queue manager will schedule your job to run on the queue that you designate, according to a predetermined site policy meant to balance competing user needs and to maximize efficient use of the cluster resources.
Job Queues
The HPC system is set up to support large computation jobs, following maximum number of CPUs and maximum processing time are allowed per batch job:
(A) Default queues available to all users
Queue Name | Maximum no. of node | Maximum Processing time (wall clock time) | Resource per node |
---|---|---|---|
debug | 2 | 30 minutes | 20 cores, 96GB RAM |
parallel | 4 | 24 hours | |
fourday | 6 | 96 hours |
(B) Queues that require special approval*
Queue Name | Maximum no. of node | Maximum Processing time (wall clock time) |
Resource per node |
---|---|---|---|
gaussian | 1 | 336 hours | 20 cores, 96GB RAM |
special | 24 | ||
gpu | 2 | 20 cores, 96GB RAM, two K20X GPU | |
mic | 2 | 20 cores, 96GB RAM, two Xeon Phi 7120P | |
hugemem | 1 | 40 cores, 512GB RAM |
* User with program/application which can show good efficiency and scalability, could request more computation resource per job for their intensive parallel computation. User with program/application which applicable to special computing resource(i.e. GPU, MIC, huge memory), could request using special resource. Please fill in CF162f to apply for additional computing resources for using research computing (HPC/AI/HTC) facilities.
Furthermore, the job scheduling is set in such a fashion that higher priority will be given parallel jobs requiring a larger number of processors.
PBS Job command file
To execute your program in the cluster system, you need to write a batch script and submit it to the queue manager. Sample of general PBS scripts can be obtained at your hpc2015 home directory (~/pbs-samples/) and refer to individiual software user guide.
To utilize the GPU/MIC resources in special compute nodes, additional general resources(GRES) have to be defined as shown in following examples.
1. Four CPU cores and two GPU card
#PBS -q gpu
#PBS -l nodes=1:ppn=4
#PBS -W x=GRES:gpu@2
2. Four CPU cores and one MIC card
#PBS -q mic
#PBS -l nodes=1:ppn=4
#PBS -W x=GRES:mic@1
Useful Commands
Submitting a Job
To submit the job, we use this command qsub
$ qsub pbs-mpi.cmd
226.hpc2015.hku.hk
Upon successful submission of a job, PBS returns a job identifier of the form JobID.hpc2015.hku.hk where JobID is an integer number assigned by PBS to that job. You’ll need the job identifier for any actions involving the job, such as checking job status or deleting the job.
When the job is being executed, it stores the program outputs to the file JobName.xxxx where xxxx is the job identifier of the job. At the end of the job, the file JobName.oxxxx and JobName.exxxx would also be copied to the working directory to show the standard output and error which were not explicited redirected in the job command file.
Manipulating a Job
There are some commands for manipulating the jobs
- List all your jobs status
$ qa hpc2015.hku.hk: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------- -------- -------- ---------- ------ ---- ----- ------- --------- - --------- 216.hpc2015-mg h0xxxxxx gaussian test 26530 1 20 -- 336:00:00 R 03:33:30 226.hpc2015-mg h0xxxxxx fourday MIrSi50g 6859 4 80 -- 96:00:00 R 01:20:03
Job information provided
Username : Job owner NDS : Number of nodes requested TSK : Number of processors requested Req’d Memory : Requested amount of memory Req’d Time : Requested amount of wallclock time Elap Time : Elapsed time in the current job state S : Job state (E-Exit; R-Running; Q-Queuing) - List running node(s) of a job
Command : qstat -n <JOB_ID> or qa -n <JOB_ID>$ qstat -n 226 hpc2015.hku.hk: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------- -------- -------- ---------- ------ ---- ----- ------- --------- - --------- 216.hpc2015-mg h0xxxxxx gaussian test 26530 1 20 -- 336:00:00 R 03:34:30 GP-4-20/0+GP-4-20/1+GP-4-20/2+GP-4-20/3+GP-4-20/4+GP-4-20/5+GP-4-20/6+GP-4-20/7 +GP-4-20/8+GP-4-20/9+GP-4-20/10+GP-4-20/11+GP-4-20/12+GP-4-20/13+GP-4-20/14 +GP-4-20/15+GP-4-20/16+GP-4-20/17+GP-4-20/18+GP-4-20/19
- Checking the resource utilization of a running job
Command : ta <JOB_ID>$ ta 216 JOBID: 216 ================================ GP-4-20 =================================== top - 16:41:18 up 149 days, 11:54, 0 users, load average: 20.05, 19.80, 19.73 Tasks: 608 total, 2 running, 606 sleeping, 0 stopped, 0 zombie Cpu(s): 79.0%us, 1.9%sy, 0.0%ni, 16.0%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 99077612k total, 10895060k used, 88182552k free, 84436k buffers Swap: 122878968k total, 19552k used, 122859416k free, 7575444k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29144 h0xxxxxx 20 0 97.5g 84m 6200 R 1995.2 1.7 4982:46 l502.exe 2667 h0xxxxxx 20 0 15932 1500 1248 S 2.0 0.0 0:00.00 top 2622 h0xxxxxx 20 0 98.8m 1284 1076 S 0.0 0.0 0:00.00 sshd 2623 h0xxxxxx 20 0 105m 896 696 S 0.0 0.0 0:00.00 g09 2668 h0xxxxxx 20 0 100m 836 1168 S 0.0 0.0 0:00.00 226.hpc2015 29800 h0xxxxxx 20 0 105m 1172 836 R 0.0 0.0 0:00.00 bash 29801 h0xxxxxx 20 0 100m 848 728 S 0.0 0.0 0:00.00 grep 29802 h0xxxxxx 20 0 98.6m 604 512 S 0.0 0.0 0:00.00 head Filesystem Size Used Avail Use% Mounted on /dev/md2 1.5T 59G 1.4T 5% /tmp
You can see the CPU utilization under CPU stats. This example show the process 1502.exe running in parallel on the 20-core system with 1995.2% of the CPU utilization (2000% utilization means all 20 cores of GP-4-20 are fully used). It also provides information such as memory usage(10895060k ~ 10MB used) , runtime of the processes and local /tmp disk usage(59GB used).
- List all nodes
$ pa GP-1-1 GP-1-2 GP-1-3 ...... GP-2-7 jobs = 0/226, 1/226, 2/226, 3/226, 4/226, 5/226, 6/226, 7/226, 8/226, 9/226, 10/226, 11/226, 12/226, 13/226, 14/226, 15/226, 16/226, 17/226, 18/226, 19/226 GP-2-8 jobs = 0/226, 1/226, 2/226, 3/226, 4/226, 5/226, 6/226, 7/226, 8/226, 9/226, 10/226, 11/226, 12/226, 13/226, 14/226, 15/226, 16/226, 17/226, 18/226, 19/226 ...... GP-4-20 jobs = 0/216, 1/216, 2/216, 3/216, 4/216, 5/216, 6/216, 7/216, 8/216, 9/216, 10/216, 11/216, 12/216, 13/216, 14/216, 15/216, 16/216, 17/216, 18/216, 19/216 ......
- Delete a job
Command : qdel <JOB_ID>$ qdel 226
Checking Queue Usage
You can use command ‘queue‘ or ‘q‘ to check status of queues that you are authorized to use.
$ queue +---------------+-----------+--------+------------+---------+---------+ | | Cores | Total | Available | Running | Queuing | | Queue Name | Per Node | Nodes | Full Nodes | Jobs | Jobs | +---------------+-----------+--------+------------+---------+---------+ debug 20 2 2 0 0 parallel 20 24 0 19 10 fourday 20 24 0 15 5 gaussian 20 32 2 30 2 hugemem 40 3 1 4 1
Checking Disk Quota
To check your disk usage, you can use the ‘diskquota‘ command:
$ diskquota Disk quotas for user h0xxxxxx at Thu May 5 15:52:32 HKT 2019: +----------------------+------------------------------+---------------------------------+ | | Block limits | File limits | | Filesystem | used quota limit grace | files quota limit grace | +----------------------+------------------------------+---------------------------------+ /home/h0xxxxxx 5665M 20480M 21504M 0 146k 0 0 0 /data/h0xxxxxx 37.32G 100G 105G - 25564 0 0 -