General

Resource Management System


General

System overview

Gridpoint system consists of 144 compute nodes(1,408 cores in total).

  • 128 quad-core nodes. Each node contains two 64-bit Intel quad-core Nehalem processors of 2.53GHz and 32GB of RAM.
  • 16 six-core nodes. Each nodecontains two 64-bit 6-core Intel Westmere processors of 2.66GHz and 48GB of RAM.

The GRIDPOINT uses the 64-bit Scientific Linux 5.3 as its operating system. Reading of this user guide and using of the system assumes familiarity with the Linux/Unix software environment. In order to get an understanding of the Linux/UNIX, please study the UNIX user’s guide in the ITS web page.

The gridpoint uses the Torque Resource Manager software to distribute the computational workload across the processors. Torque, similar to OpenPBS, is a batch job scheduling application that provides the facility for building, submitting and processing batch jobs on the system.

Jobs are submitted to the system by creating a PBS job command file that specifies certain attributes of the job, such as how long the job is expected to run and, in the case of parallel programs, how many processors are needed, and so forth. PBS then schedules when the job is to start running on the cluster (based in part on those attributes), runs and monitors the job at the scheduled time, and returns any output to the user once the job completes.

Logging into the system

Logins to the Gridpoint system can be done inside HKU campus network through SSH. If you are outside the University network, you should connect to the HKU Campus network via HKUVPN with 2FA first. SSH is not bundled by MS Windows that you may require to download SSH client like PuTTY. Please visit SSH with Putty for more details[Host Name: gridpoint.hku.hk).

If you connect to Gridpoint from a UNIX or Linux system with SSH:

ssh <username>@gridpoint.hku.hk   or    ssh -l <username> gridpoint.hku.hk

After you login, the places you on the master node, which acts as the control console for any interactive work such as source code editing, compilation, program testing and submitting jobs through Torque Resource Manger. When you log on to the master node you should be in your home directory (either /home1/$LOGNAME or /home2/$LOGNAME) which is also accessible by batch nodes.

Editing the program

You can use the command vi, emacs or pico to edit programs. Please refer to the UNIX user’s guide for detail.

Important notice for Microsoft Windows users: do not use a standard Microsoft Windows editor such as Notepad to edit files that will be used on the Linux or other Unix systems. The two systems use different sequences of control characters to mark the end of line (EOL). If you are using the system from a Microsoft Windows desktop machine, please SSH to the master node and edit the program directly using pico/nano.

Configuring your account

Every user account is pre-configured with the neccessary environment. You can use all software in the system without any modification to the system files like .rhosts, .bashrc or .bash_profile.

You can copy the most up-to-date system files from the directory /etc/skel in case your copies are deleted by accident.

When a job is submitted to the cluster through Torque a new login to your account is initiated, and any initialization commands in your startup files (.bashrc, .bash_profile, etc) are executed. In this case (running in batch mode) it is necessary not to put interactive commands such as setting tset and stty or generating outputs in the startup files. If these precautions are not taken then error messages will be written to the batch jobs error file and your program cannot run.

Transferring data file

Similar to logging into the system, you can use SCP to connect gridpoint.hku.hk within HKU campus network. SCP is not bundled by MS Windows that you may require to download SCP client like WINSCP. You may visit SSH and Secure File Transfer for more details.

If you already have some file(s) or folder(s) located at other HPC systems(e.g. hpc2015) and would like to copy them to gridpoint, you may use the SCP command:

scp $SRC_HOST:$SRC_FILE $DST_HOST:$DST_FILE


Resource Management System

Torque Resource Manager

The Torque Resource Manager is the replacement of OpenPBS resource management system, which handles the management and monitoring of the computational workload on the Gridpoint system. Users submit “jobs” to the resource management system where they are queued up until the requested system resources is allocated before the “jobs” are started. Torque selects which jobs to run, what time to run, and which nodes to run, according to a predetermined site policy meant to balance competing user needs and to maximize efficient use of the cluster resources.

To use Torque, you have to create a batch job command file which you submit to the Torque server to run on the system. A batch job file is simply a shell script containing the set of commands you want to run on the batch nodes. It also contains directives which specify the characteristics (attributes), and resource requirements (e.g. number of nodes and maximum runtime) that your job needs. Once you create your PBS job file, you can reuse it if you wish or modify it for subsequent runs.

Since the system is set up to support large computation jobs, the following maximum number of CPUs and maximum processing time are allowed for each batch job:

(A) Queues available to all users

Queue Name Max. no. of core(s) per job Processing time per job (wall clock time) Memory per node Network Connection
serial 1 (i.e. 1 node with 1 core) max. 168 hours 32GB Gb Ethernet
parallel 16 (i.e. 2 nodes of 8 cores) max. 24 hours
fourday 64 (i.e. 8 nodes of 8 cores) min. 10 hours , max. 96 hours
oneday-6c 24 (i.e. 2 nodes of 12 cores) max. 24 hours 48GB
ib-parallel 16 (i.e. 2 nodes of 8 cores) max. 24 hours 32GB InfiniBand

(B) Queues that require special application*

Queue Name Max. no. of core per program job Processing time per job (wall clock time) Memory per node Network Connection
twoweek 16 (i.e. 2 nodes of 8 cores) min. 96 hours, max. 336 hours 32GB Gb Ethernet
ib-oneweek 112 (i.e. 14 nodes of 8 cores) max. 168 hours InfiniBand
ib-special 128 (i.e. 16 nodes of 8 cores) min. 168 hours, max. 2160 hours 16GB

* Starting from Jan 2016, processing time of parallel queue is extended to 24 hours. User can apply to use queues(twoweek, ib-oneweek, etc.) after they familiar with the cluster system environment(UNIX file system commands, PBS batch system, etc.), by submitting CF162e (for staff) or CF162f (for students).

* User with program/application which can shown good efficiency and scalability could request more computation resource per job for their intensive parallel computation.

Furthermore, the job scheduling is set in such a fashion that higher priority will be given parallel jobs requiring a larger number of processors.

In order to provide a fair share environment for all users, the system is set such that each user can put no more than 10 jobs on the job queue and no more than 20 jobs should be running at the same time.

PBS Job command file

To submit a job to run on the MPI environment, a PBS job command file must be created. The job command file is a shell script that contains PBS directives which are preceded by #PBS.

The following is an example of a PBS command file to run a parallel job, which would require 2 node with 8 cores each node. You should only need to change items indicated in red. This file is also located in the system as /etc/skel/pbs-mpi.cmd.

#!/bin/sh
### Job name
#PBS -N test-mpi

### Declare job non-rerunable
#PBS -r n

### Queue name (parallel, fourday)
#PBS -q parallel

### Wall time required. This example is 4 hours 30 min
#PBS -l walltime=04:30:00

### Number of nodes
### The following means 1 node and 1 core.
### Clearly, this is for serial job
###PBS -l nodes=1:ppn=1

### The following means 2 nodes required. Processor Per Node=20,
### i.e., total 40 CPUs needed to be allocated.
### ppn (Processor Per Node) can be any integer from 1 to 20.
#PBS -l nodes=2:ppn=20

### The following stuff will be executed in the first allocated node.
### Please don't modify it.
echo $PBS_JOBID : `wc -l < $PBS_NODEFILE` CPUs allocated: `cat $PBS_NODEFILE`
cd $PBS_O_WORKDIR
### Define number of processors
NPROCS=`wc -l < $PBS_NODEFILE`
JID=`echo ${PBS_JOBID} | sed "s/.hpc2015-mgt.hku.hk//" `
MACHFILE=/tmp/machine.j${JID}
cat $PBS_NODEFILE | /usr/bin/uniq | /bin/awk 'print $1":20"' > $MACHFILE
echo ===========================================================
echo "Job Start Time is `date "+%Y/%m/%d -- %H:%M:%S"`"


### Run the parallel MPI executable "a.out"
time mpirun -np ${NPROCS} -machinefile ${MACHFILE} ./a.out > ${PBS_JOBNAME}.${JID}

echo "Job Finish Time is `date "+%Y/%m/%d -- %H:%M:%S"`"
rm -f ${MACHFILE}

After the PBS directives in the command file, the shell executes a change directory command to $PBS_O_WORKDIR, a PBS variable indicating the directory where the PBS job was submitted and nominally where the progam executable is located. Other shell commands can be executed as well. In the mpirun line, the executable itself is invoked.

If we are running MPI program, then the command “mpirun -np ${NPROCS} -machinefile ${MACHFILE} ./programfile ” should be used. It is necessary to tell the MPI how many nodes and where the machine file.

The parameter ${PBS_JOBNAME}.${JID} would redirect the standard output of the program to a text file JobName.JobID . You can inspect this file from time to time to check the progress of the program.

Submitting a Job

To submit the job, we use this command qsub

[h0xxxxxx@gridpoint test]$ qsub pbs-mpi.cmd
216.gridpoint.hku.hk

Upon successful submission of a job, PBS returns a job identifier of the form JobID.gridpoint.hku.hk where JobID is an integer number assigned by PBS to that job. You’ll need the job identifier for any actions involving the job, such as checking job status or deleting the job.

When the job is being executed, it stores the program outputs to the file JobName.xxxx where xxxx is the job identifier of the job. At the end of the job, the file JobName.oxxxx and JobName.exxxx would also be copied to the working directory to show the standard output and error which were not explicited redirected in the job command file.

Manipulating a Job

There are some commands for manipulating the jobs

  1. List all your jobs status
    [h0xxxxxx@gridpoint test]$ qa
    gridpoint.hku.hk:
                                                         Req'd      Elap 
    Job ID     Username Queue    Jobname    SessID NDS   Time     S Time
    ---------- -------- -------- ---------- ------ ----- -------- - --------
    216.gridpo h0xxxxxx parallel test       26530      2 04:30:00 R 03:33:30
    226.gridpo h0xxxxxx oneday-6 MIrSi50g   6859       1 24:00:00 R 00:20:03

    Job information provided

    Username : Job owner
    NDS : Number of nodes requested
    Req’d Time : Requested amount of wallclock time
    Elap Time : Elapsed time in the current job state
    S : Job state (E-Exit; R-Running; Q-Queuing)
  2. List all nodes
    [h0xxxxxx@gridpoint test]$ pa
    ma00
    ma01
    ma02
    ...
    md14
         jobs = 0/216, 1/216, 2/216, 3/216, 4/216, 5/216, 6/216, 7/216
    md15
         jobs = 0/216, 1/216, 2/216, 3/216, 4/216, 5/216, 6/216, 7/216
    ....
    mj02
         jobs = 0/226, 1/226, 2/226, 3/226, 4/226, 5/226, 6/226, 7/226, 8/226, 9/226, 10/226, 11/226 
  3. List running node(s) of a job
    Command : qstat -n <JOB_ID>    or    qa -n <JOB_ID>

    [h0xxxxxx@gridpoint test]$ qstat -n 216
    gridpoint.hku.hk:
                                                         Req'd      Elap 
    Job ID     Username Queue    Jobname    SessID NDS   Time     S Time
    ---------- -------- -------- ---------- ------ ----- -------- - --------
    216.gridpo h0xxxxxx parallel C2Irfc      26530     2 04:30:00 R 03:34:30
       md14/7+md14/6+md14/5+md14/4+md14/3+md14/2+md14/1+md14/0+md15/7+md15/6+md15/5
       +md15/4+md15/3+md15/2+md15/1+md15/0
  4. Checking the node utilization for a job
    Command : ta <JOB_ID>

    [h0xxxxxx@gridpoint test]$ ta 226
    JOBID: 226
    =============================mj02===============================
    top - 10:17:18 up 149 days, 11:54,  0 users,  load average: 12.00, 12.01, 11.94
    Tasks: 128 total,   2 running, 126 sleeping,   0 stopped,   0 zombie
    Cpu(s): 65.0%us,  3.7%sy,  0.0%ni, 40.0%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
    Mem:  49432464k total,  7914568k used, 41517896k free,   472972k buffers
    Swap: 50339668k total,    63292k used, 50276376k free,  5146572k cached
    
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
    28276 h0xxxxxx  25   0 6607m 266m 6200 R 1197.7  0.8 482:46.60 l502.exe          
    28214 h0xxxxxx  18   0 65932 1500 1248 S  0.0  0.0   0:00.00 bash               
    28263 h0xxxxxx  22   0 63840 1284 1076 S  0.0  0.0   0:00.00 226.gridpoint    
    28272 h0xxxxxx  18   0 90040  896  696 S  0.0  0.0   0:00.00 g09                               
    28800 h0xxxxxx  15   0 12604 1172  836 R  0.0  0.0   0:00.00 top                
    
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sda5             181G  188M  171G   1% /tmp
    

    You can see the CPU utilization under CPU stats. This example show the process 1502.exe running in parallel on the 12-core system with 1197.7% of the CPU utilization (1200% utilization means all 12 cores of mj02 are fully used). It also provides information such as memory usage and runtime of the processes.

  5. Delete a job
    Command : qdel <JOB_ID>

    [h0xxxxxx@gridpoint test]$ qdel 226