ARCUS-HTC Reference Guide

Contents

Description of Arcus-HTC

Accessing the system

Differences between Arcus-HTC and Arcus-B

Arcus-HTC Resources

Using GPUs

Installed Applications

Using Containers

Submission Script Examples

 

What is Arcus-HTC?

The ARC High Throughput Cluster (or Arcus-HTC) has been created by repurposing the original Arcus cluster (Arcus-A). This will allow us to better support users with lower-core count workloads and will also automatically pack users jobs to share compute nodes, thus increasing job throughput.

Single core jobs, and sub single node core count jobs account for a substantial proportion of the jobs submitted by users to both Arcus systems; the Arcus-HTC configuration is aimed at improving the overall throughput of these jobs (through more effective node utilisation via sharing) as well as allowing for more efficient usage of compute credits for this class of jobs, as these jobs will no longer have to hold a node on an exclusive basis unless required.

How do I access Arcus-HTC

Using your ARC user account you can connect using the following commands (substitute your own ARC username for user1234):

ssh -X user1234@oscgate.arc.ox.ac.uk

Once successfully connected to oscgate type:

ssh -X arcus-htc

Differences between Arcus-A/B and Arcus-HTC

Partitions

The Arcus-HTC system uses the SLURM resource manager (the same as that used by Arcus-B) - this will ease the transition of jobs between the two systems. The old Arcus-A system used the Torque resource manager which used a different syntax for submission scripts.

The Arcus-HTC has a number of different "partitions" which are used to manage workloads for specific types of machine. This is like the general purpose "compute" and special "gpu" partitions on Arcus-B. On the HTC system the following partitions are available:

Partition name Description Wall Time Limit
htc (Default) CPUs 5 Days
ARC GPU Resources - K40, K80, V100 5 Days
Co-investment GPU Resources - P100, V100 1 Day
htc-nova Novel architectures e.g. Intel Phi 12 hours
htc-devel Development partition (testing/debugging jobs) 10 minutes

Note: There is not a specific partition for GPU nodes on Arcus-HTC; GPU resources are specified using the gres resource specifier in the submission script, see here.

Exclusive vs. Non-exclusive access

The HTC cluster allows non-exclusive use of nodes. Arcus-B is configured to use node exclusive mode, this means that if you submit a single core job you are allocated exclusive use of a node with 16 cores, so effectively wasting 15 cores worth of compute time. This is not efficient use of the system nor of your ARC credits - the consumption of which are based on this exclusive use.

Non-exclusive use means that if your job requests less cores than constitute a whole node, you will only be allocated this number and other users can be allocated the remaining cores. In the most extreme case, this may result in 16 different users each using a single core of a 16 core node. This non-exclusive use also means you will only consume ARC credit for the cores you are actually allocated.

Resource availability

As when using Arcus-HTC you may find that your job shares a node with other users, it becomes more important to request resources that your job will require.

Specifying Memory Requirements

By default each core assigned for a job will be allocated an amount of RAM equal to the total amount of memory in the system divided by the number of cores. For a 64GB machine with 16 cores this means a single core job will be allocated 4GB of memory. If your jobs requires more memory, for example 12GB you will need to request this in your submission script e.g.

#SBATCH --mem=12288

Please note that SLURM will terminate your job automatically if you exceed your memory limit, this is to protect other users' resource allocations.

Feature Constraints

If your job has specific requirements that can only be fulfilled by some nodes - for example, specific CPU versions or clock frequencies - these can be specified as constraints on job submissions. The syntax is:

#SBATCH --constraint=feature1&feature2

#SBATCH --constraint=feature1|feature2

#SBATCH --constraint=[feature1|feature2]

where '&' is a logical and, i.e. '--constraint=feature1&feature2' means 'feature1 AND feature2', indicating the node must have both feature1 and feature2; '|' is 'OR', i.e. '--constraint=feature1|feature2' means the node must either have feature1 or feature2; '--constraint=[feature1|feature2]' means that all tasks of the job need the same version of the constraint, but it can be either feature1 or feature2 - generally, constraints would be evaluated per task, this will make all tasks in the job use the same version of a requested node feature.

Available node features are:

cpu_gen: CPU family
cpu_sku: CPU model
cpu_frq: CPU frequency
cpu_mem: total memory
knl: Knight's Landing (signals to SLURM that the request is for a KNL node and enables additional feature setting, see 'KNL features' below).

The full breakdown of currently available nodes and their resources is:

cpu_gen cpu_sku cpu_frq cpu_mem gpu_gen gpu_sku gpu_mem gpu_cc GRES gpu type partition no. of nodes
Haswell E5-2640v3 2.60GHz 64GB Kepler K40 12GB 3.5 k40m htc 9
K80 3.7 k80 5
Broadwell E5-2640v4 2.4GHz 128GB Pascal P100 24GB 5.0 p100 htc 1
SandyBridge E5-1650 3.20GHz 64GB Pascal, Maxwell P4, M40 24GB, 8GB 5.0,6.1 p4, m40 htc-nova 1
E5-2650 2.00GHz 64GB - - - - - htc 84
E5-2650 2.00GHz 128GB - - - - - htc 4
Skylake Gold_5120 2.20GHz 384GB Pascal P100 16GB 6.0 p100 htc 5
Volta V100 16GB 7.0 v100 htc 2
Knight's Landing (KNL) 7290 1.50GHz 192GB - - - - - htc-nova 3

To, for example, request the node to either be Sandy Bridge or Haswell, the request would be

#SBATCH --constraint='cpu_gen:SandyBridge|cpu_gen:Haswell'

To request a SandyBridge with a K40 GPU, the request would be (see also the GPU section below):

#SBATCH --gres=gpu:1
#SBATCH --constraint='cpu_gen:SandyBridge&gpu_sku:K40'

Note that asking for 'cpu_mem:128GB' will not allocate 128GB memory to your job (nor allow it to use that much memory); it will simply allocate the job to a node with 128GB total memory. For explicitly requesting job memory, please see above.

KNL nodes

On the Knight's Landing nodes, there are additional available constraints - 'knl' which simply signals to SLURM that the the request is KNL related, and a range of features related to its memory and NUMA configuration - cache,hybrid,flat,a2a,snc2,snc4,hemi,quad. On the KNL nodes, requesting these will make SLURM attempt to fulfil the request even if no node is currently in the requested configuration; it will, as part of the job, reconfigure the node and reboot it.

Using GPUs

The Arcus-HTC cluster contains various GPU nodes, these are available by requesting appropriate GPU resources. There is no SLURM "gpu" partition on Arcus-HTC, so you do not need to specify a partition value.

The most basic way you can access a GPU is by requesting a GPU device using the --gres option in your submission script:

#SBATCH --gres=gpu:1

The above will request 1 single GPU device (of any type) - this is the same as the method previously used on ARCUS-B.

You may also request a specific type of GPU device, for example:

#SBATCH --gres=gpu:k40m:1

To request one K40 device, or:

#SBATCH --gres=gpu:k80:2

To request two K80 devices. Available devices are K40m, K80, M40, P4, P100 and V100.

Alternatively you can request a specific GPU SKU, GPU generation, or GPU compute capability:

#SBATCH --gres=gpu:1 --constraint='gpu_sku:K40'

#SBATCH --gres=gpu:1 --constraint='gpu_gen:Kepler'

#SBATCH --gres=gpu:1 --constraint='gpu_cc:3.7'

#SBATCH --gres=gpu:1 --constraint='gpu_mem:32GB'

#SBATCH --gres=gpu:1 --constraint='nvlink:2.0'

Configured GPU related constraints are:

gpu_gen: GPU generation (Maxwell, Kepler, Pascal, Volta)
gpu_sku: GPU model (K40, K80, P100, V100)
gpu_cc: CUDA compute capability
gpu_mem: GPU memory
nvlinik: device has nvlink - contraint exist as simple (-C nvilink) and specifying version (-C 'nvilink:2.0')

For details on available options/combinations see the table of available GPUs.

Please note that for  P100 and V100 co-investment GPU nodes, the maximum job run time is 1 day. For ARC GPU nodes, the maximum partition run time (5 days) applies - see the table of available GPUs for more information.

Installed Applications

The centrally installed applications available for use on Arcus-HTC can be listed by using the modules command:

module avail

For more information on using the module command see here

Containers

The Arcus-HTC system allows the use of Singularity containers. These can either be provided by the user or be centrally managed. To load the Singularity environment you can use the following command:

module load singularity

See the Singularity Documentation for more information on how to use the container system.

Please note: Docker containers are not natively supported, but these may be converted into Singularity containers and run on the HTC system.

Submission Script Examples

The scripts below all utilise a simple example application which is available on the HTC system. The aim is to demonstrate various ways of utilising the HTC service efficiently.

Single Core

This is the most basic use of the system where you have an application which only uses a single CPU core.

#!/bin/bash

#SBATCH --time=00:10:00
#SBATCH --job-name=single_core
#SBATCH --ntasks-per-node=1
#SBATCH --partition=htc

module purge
module load testapp/1.0

# Calculate number of primes from 2 to 1000000

prime 2 1000000

Single Core (Job Array)

The HTC system is also particularly suited to so-called job arrays. A job array is, basically, a single cluster job with multiple tasks, often used to either run multiple (identical) copies of the same program or to run multiple copies of a single-threaded software with varying input parameters.

An example submission script could look like:

#!/bin/bash

#SBATCH --time=00:10:00
#SBATCH --array=1-4
#SBATCH --job-name=array_job
#SBATCH --ntasks-per-node=1
#SBATCH --output=array_job%A_%a.out
#SBATCH --error=array_test%A_%a.err

### Note in the above lines "%A" is replaced automatically by the job ID and "%a" with the array index PARAMFILE="./prime_params"

task_parameter=$(sed -n ${SLURM_ARRAY_TASK_ID}p ${PARAMFILE})

module purge
module load testapp/1.0

prime ${task_parameter}

To make the above work you also need a parameter file, in the above example this should be named "prime_params" and contain the following parameters for the "prime" application:

2 250000
250001 500000
500001 750000
750001 1000000

For further information on array jobs see the detailled SLURM information here

Multi-Core (Packed)

Utilising exactly the same single core code as in the above example, we can now use 8 cores and have the code run on each core but on a different range of values:

#!/bin/bash

#SBATCH --time=00:10:00
#SBATCH --job-name=packed
#SBATCH --ntasks-per-node=4
#SBATCH --partition=htc

 

module purge
module load testapp/1.0

 

# Calculate number of primes from 2 to 1000000
# Multiple single core processes (backgrounded)

 

prime 2 250000 &
prime 250001 500000 &
prime 500001 750000 &
prime 750001 1000000 &

 

# Need following line to wait for processes to complete.
wait

Multi-Core (OpenMP)

The code has now been recompiled to support OpenMP multiprocessing:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --job-name=openmp
#SBATCH --ntasks-per-node=4
#SBATCH --partition=htc

module purge
module load testapp/1.0

# Set number of OpenMP threads to same as allocated cores

export OMP_NUM_THREADS=$SLURM_NTASKS_PER_NODE

# Calculate number of primes from 2 to 1000000
prime_omp 2 1000000