SLURM

The tool we use to manage the submission, scheduling and management of jobs in HPC2021 and AI-Research is called SLURM. On a login node, user writes a batch script and submit it to the queue manager to schedule for execution in the compute nodes. The submitted job then queue up until the requested system resources is allocated. The queue manager will schedule a job to run on the queue (or partition in SLURM) according to a predetermined site policy designated to balance competing user needs and to maximize efficient use of cluster resources.

Each job’s position in the queue is determined through the fairshare algorithm, which depends on a number of factors (e.g. size of job, time requirement, job queuing time, resource usage in previous month etc). The HPC system is set up to support large computation jobs. Maximum CPUs and processing time limits are summarized in the tables below. Please note that the limits are subject to change without notice.

Cheat sheet for SLURM job scheduler is available at https://slurm.schedmd.com/pdfs/summary.pdf.


    1. Partition & QoS
    2. Job Scripts
    3. Job Array
    4. Job Dependencies
    5. Job Submission
    6. Job Management
    7. Usage Monitoring