You can also see the latest version here.

1 Job submission guide

Generally, you need to submit the jobs with the allocation of the full resources (nodes and cpu cores),

so that the jobs could work efficiently and don’t have to queue for many times for the same computation.

  • As more sources as possible

    For CPU nodes, set cpu cores == 40, for GPU nodes, set cpu nodes == 20, for amd nodes, set cpu cores == 94

    # For CPU/himem
    -n 40 --ntasks=40
    
    # For GPU
    -n 20 --ntasks=20
    
    # For amd
    -n 94 --ntasks=94
    
  • Urgent jobs

    For urgent task, you may share the cores with other users so that you don’t have to queue for a long time.

    In this case, you can adjust the cpu core number as few as possible, so that you can share the nodes with other users and don’t have to queue for the whole node.

2 Resource overview

Please refer to the Hardware overview for the latest updates.

The maximum number of the cores for the nodes is listed:

# CPU/HIMEM nodes
MAX_CPU_CORE=40

# GPU nodes
MAX_CPU_CORE=20

3 Submission guide

3.1 Check the available nodes

You need to check the available nodes by the following conmmand:

sinfo

Then you can submit your jobs into following nodes:

  • Idle nodes (there is currently no one using this node)
  • Mixed nodes (share cores with other users in the same node)

3.1.1 Get idle nodes

Find the idle node partition:

sinfo -a | awk 'NR==1 || /idle/'

# Output:
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
jiy            up   infinite      2   idle hhnode-ib-[32-33]
cpu            up   infinite     12   idle hhnode-ib-[95-98,100,237-238,240-242,251-252]
himem          up   infinite      1   idle hhnode-ib-103
gpu3090        up   infinite      9   idle hhnode-ib-[186-187,190-196]
isd            up   infinite      1   idle hhnode-ib-233
dbm            up   infinite      1   idle hhnode-ib-234
amd            up   infinite      4   idle hhnode-ib-[253-256]

3.1.2 Get mixed nodes

Find the mixed node paritions:

sinfo -a | awk 'NR==1 || /mix/'

# Output:
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu            up   infinite     15    mix hhnode-ib-[16,36,46-47,56,69,86,99,201,231,239,243,245-246,249]
cpu-share      up   infinite      9    mix hhnode-ib-[16,36,46-47,56,69,86,201,231]
gpu            up   infinite      4    mix hhnode-ib-[106-107,140,142]
gpu-share      up   infinite      4    mix hhnode-ib-[106-107,140,142]
x-gpu          up   infinite      2    mix hhnode-ib-[153,159]
x-gpu-share    up   infinite      2    mix hhnode-ib-[153,159]
gpu3090        up   infinite      1    mix hhnode-ib-189
math           up   infinite      2    mix hhnode-ib-[235-236]

3.2 Submit jobs on idle nodes (Whole node)

You can use srun/sbatch/salloc to apply for the resources to access sources for the full node

3.2.1 For srun on CPU/HIMEM nodes:

# -N 1 -n 40
srun srunCommand.sh

cat srunCommand.sh

#!/bin/bash

#SRUN -p cpu-share # himem-share
#SRUN -N 1
#SRUN -n 40
#SRUN --exclusive
#SRUN -J testJobName #Slurm job name
#SRUN -t 24:00:00 #Maximum runtime of 48 hours
#SRUN --mail-user=user_name@ust.hk #Update your email address
#SRUN --mail-type=begin
#SRUN --mail-type=end
#SRUN --pty bash

3.2.2 For srun on GPU nodes:

# -N 1 -n 20
srun srunCommand.sh

cat srunCommand.sh

#!/bin/bash

#SRUN -p gpu-share 
#SRUN -N 1
#SRUN -n 20
#SRUN --exclusive
#SRUN -J testJobName #Slurm job name
#SRUN -t 24:00:00 #Maximum runtime of 48 hours
#SRUN --mail-user=user_name@ust.hk #Update your email address
#SRUN --mail-type=begin
#SRUN --mail-type=end
#SRUN --pty bash

3.2.3 For srun on amd nodes:

# -N 1 -n 94
srun srunCommand.sh

cat srunCommand.sh

#!/bin/bash

#SRUN -p amd
#SRUN -N 1
#SRUN -n 20
#SRUN --exclusive
#SRUN -J testJobName #Slurm job name
#SRUN -t 24:00:00 #Maximum runtime of 48 hours
#SRUN --mail-user=user_name@ust.hk #Update your email address
#SRUN --mail-type=begin
#SRUN --mail-type=end
#SRUN --pty bash

3.2.4 For sbatch on CPU/HIMEM nodes:

# -N 1 -n 40
sbatch sbatchCommand.sh

cat sbatchCommand.sh

#!/bin/bash

#SBATCH -p cpu-share # himem-share
#SBATCH -N 1
#SBATCH -n 40
#SBATCH --exclusive
#SBATCH --gres-flags=enforce-binding
#SBATCH -J testJobName #Slurm job name
#SBATCH -t 24:00:00 #Maximum runtime of 48 hours
#SBATCH --mail-user=user_name@ust.hk #Update your email address
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

3.2.5 For sbatch on GPU nodes:

# -N 1 -n 20
sbatch sbatchCommand.sh

cat sbatchCommand.sh

#!/bin/bash

#SBATCH -p gpu-share
#SBATCH -N 1
#SBATCH -n 20
#SBATCH --exclusive
#SBATCH --gres-flags=enforce-binding
#SBATCH -J testJobName #Slurm job name
#SBATCH -t 24:00:00 #Maximum runtime of 48 hours
#SBATCH --mail-user=user_name@ust.hk #Update your email address
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

3.2.6 For sbatch on amd nodes:

# -N 1 -n 94
sbatch sbatchCommand.sh

cat sbatchCommand.sh

#!/bin/bash

#SBATCH -p amd
#SBATCH -N 1
#SBATCH -n 94
#SBATCH --exclusive
#SBATCH --gres-flags=enforce-binding
#SBATCH -J testJobName #Slurm job name
#SBATCH -t 24:00:00 #Maximum runtime of 48 hours
#SBATCH --mail-user=user_name@ust.hk #Update your email address
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

3.3 Submit jobs on mixed nodes (Fast debugging)

First, check the partition and the available core number

# check mixed nodes 
sinfo -a | awk 'NR==1 || /mix/'

# Output:
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu            up   infinite     15    mix hhnode-ib-[16,36,46-47,56,69,86,99,201,231,239,243,245-246,249]
cpu-share      up   infinite      9    mix hhnode-ib-[16,36,46-47,56,69,86,201,231]
gpu            up   infinite      4    mix hhnode-ib-[106-107,140,142]
gpu-share      up   infinite      4    mix hhnode-ib-[106-107,140,142]
x-gpu          up   infinite      2    mix hhnode-ib-[153,159]
x-gpu-share    up   infinite      2    mix hhnode-ib-[153,159]
gpu3090        up   infinite      1    mix hhnode-ib-189
math           up   infinite      2    mix hhnode-ib-[235-236]

# check max core number for node-86
squeue -o "%.18i %.9P %.5D %.5C %.8j %.8u %.6g %.2t %.10M %R" | awk 'NR==1 || /hhnode-ib-86/'

# Output:
JOBID PARTITION NODES  CPUS     NAME     USER  GROUP ST       TIME NODELIST(REASON)
1030063 cpu-share     1     8     CuPt hlwongac keztlu  R    2:11:28 hhnode-ib-86

Second, submit the jobs by specific the number of cores:

# so in this case, the maximum 40-8=32 cores could be used
# for example, use srun to sub the job
srun -p cpu-share -N 1 -n 32 -J testMixedNode -w hhnode-ib-86 srunCommand.sh

4 View/cancel the job

First, get your job pid by squeue, then use the scancel command to cancel the jobs

# use squeue to get the PID
squeue -u youUserName

# Output:
1025035 gpu-share     1     1      jobName1   youUserName R    3:03:31 hhnode-ib-145
1025034 gpu-share     1     1      jobName2   youUserName R    9:57:32 hhnode-ib-145


# cancel the job
scancel 1025035 1025034