Using the new ARC/HTC Environment

Contents

 

Introduction

Operating System

Cluster Description

Scheduler Changes

Application Software

Storage

 

Introduction

This page provides information for users to transition from using the ARCUS-B/HTC clusters to the new ARC/HTC systems. There are a significant number of changes which affect interaction with the new system - submission scripts from ARCUS-B/HTC will not work on the new ARC/HTC without modification, so it is important that this information is read in full - especially the known issues section which will be updated frequently.
 
 

Operating system

The new ARC/HTC systems both use CentOS 8.1 as their main operating system. This is an upgrade from CentOS 6.6 on ARCUS-B and CentOS 7.7 on ARCUS-HTC. See Applications Software section for details on the software environment.

Note that Singularity is installed as part of the operating system; there is no longer a requirement to load it as a separate module.

 

Cluster Description

ARC operates two compute clusters - arc, our parallel workloads system a replacement for ARCUS-B, and htc, our High Throughput cluster the replacement for ARCUS-HTC. The main differences between them are:

Cluster Description Login Node Compute Nodes Minimum Job Size Notes:
arc

Replacement of ARCUS-B

Our largest compute cluster. Optimised for large parallel jobs spanning multiple nodes. Scheduler prefers large jobs.

Offers low-latency interconnect (Mellanox HDR 100).

arc-login

CPU: 48 core Cascade Lake (Intel Xeon Platinum 8268 CPU @ 2.90GHz)

Memory: 392GB

 

1 core

Non-blocking island size is 2212 cores

htc

Replacement of ARCUS-HTC

Optimised for single core jobs, and SMP jobs up to one node in size. Scheduler prefers small jobs.

Also catering for jobs requiring resources other than CPU cores (e.g. GPUs).

htc-login

CPUs: mix of Broadwell, Haswell, Cascade Lake

GPU: P100, V100, A100, RTX

Novel architectures: KNL

1 core

Jobs will only be scheduled onto a GPU node if requesting a GPU resource.

 

 

Node Types

Previously on ARCUS-B/HTC login nodes were used for the preparation/submission of batch jobs and also the pre/post processing of application data. On ARC/HTC different node types are used for these workflows: 

 

Login nodes

These are only to be used for accessing the cluster and submitting jobs. Login nodes should not be used for software build or compute tasks. They are not designed for building software or running analysis; they do not have the same CPU architecture as the cluster nodes, and they do not run the same operating system as the cluster nodes. Please use the interactive nodes for any software builds (see below).

We have an explicit policy that user processes can only use a maximum of 1 hour of CPU time on login nodes.

 

Interactive nodes

These nodes should be used for pre/post processing of data and for building software to be used on the ARC clusters. See the section on interactive jobs for more information.

 

Compute nodes

Typical ARCUS-C compute nodes have 48 cores and 375GB memory per node available to jobs. This is an increase over what was available in ARCUS-B.

Access

The clusters can be accessed by SSH connection to the login nodes (arc-login or htc-login) from the Oxford University network (including VPN).
Access from outsite the University network is via the ARC SSH gateway server: gateway.arc.ox.ac.uk.

 

Scheduler

The ARC/HTC system uses SLURM as its resource manager (or scheduler). This is the same system used for ARCUS-B and ARCUS-HTC so users will be familiar with its commands and submission script syntax.

As a reminder, to do work on ARC's clusters, you will need to submit a job to the job scheduler; the login nodes are for preparing and submitting scheduler jobs and should not be used for performing computational work. If you need to run interactive computational work such as pre/post processing data or building your own code - this must be performed on interactive nodes.

Unlike on ARCUS-B nodes on ARC/HTC are not allocated exclusively to jobs; jobs are allocated the requested number of cores and may share nodes with other jobs. The default number of cores allocated is 1 (as is the default number of nodes). Default amount memory per CPU is 8000 MB. You will not be able to use resources you have not requested in your job submission; this includes memory and CPU cores.

Thus if you need more than 1 CPU core, you will need to explicitly ask for them. At its simplest this can be specified by requesting a specific number of tasks, e.g.:

#SBATCH --ntasks-per-node=8

to request 8 tasks.

For MPI job submissions this would normally be changed to asking for a number of nodes and specifying the number of tasks per node, e.g.:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48

to request two nodes with 48 tasks each.

For a hybrid MPI/OpenMP job, where an MPI tasks spawns multiple CPU threads, the specification needs to also specify how many CPUs per task the job will need. For example, to request 2 nodes with two MPI tasks per node that each start 24 compute threads, you need to request:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=24

The default number of CPUs per task is 1.

It is possible to request exclusive access to a node by adding "--exclusive" to your sbatch command or the following line to your submit script:

#SBATCH --exclusive

However, we strongly advise being specific about required resources rather than using exclusive node access; homogenity of resources (or CPU features) can not be assumed. This is especially true on the HPC system, or when submitting to multiple clusters (see section Job Scheduling).

This is a very short overview; SLURM offers various ways to specify resource requirements; please see 'man sbatch' for details.

 

Partitions

Both clusters have the following time-based scheduling partitions available:

  • short (default run time 1hr, maximum run time 12hrs)
  • medium (default run time 12hrs, maximum run time 48hrs)
  • long (default run time 24hrs, no run time limit)
  • devel (maximum run time 10 minutes - for batch job testing only) 
  • interactive (default run time 1hr, maximum run time 4hrs, can oversubscribe, for pre/post-processing and building software)

Jobs in the short and medium partitions are scheduled with higher priority than those in the long partition; however, they will not be able to run for longer than the time allowed on those partitions.

On the previous clusters (Arcus-B, Arcus-HTC), users who wanted to submit long running jobs needed to submit the jobs to the scheduler specifying an acceptable timelimit, then once the job had started running, would request that the job's walltime be extended. One the new ARC/HTC clusters, this is no longer required; users can submit jobs with long time limits to the long partition. Note: the default time limit on long partition is 1 day; users must specify a time limit if a longer runtime is required.

We will no longer extend jobs.

The htc cluster has an additional partition available named legacy. This partition contains a number of nodes which have CentOS 7.7 installed in order to maintain compatibility with some legacy commercial applications. Access to the legacy partition is restricted to users with a requirement to use legacy software and will be enabled by the ARC team for specific users when it has been demonstrated that using a more recent version of the software application is not possible.

Job Scheduling

By default jobs will be scheduled based upon the login node you using - if you are logged into arc-login jobs you submit will be queued to the arc cluster. If you are logged into htc-login jobs will be queued to the htc cluster.

However, The clusters are accessible from either login nodes and can be specified by passing --clusters=arc or --clusters=htc SLURM options.  Additionally, squeue can report the status of jobs on either cluster or both (using the option --clusters=all). 

It is possible for  jobs to target either cluster or both clusters using the --cluster specification in job scripts, for example

#SBATCH --clusters=arc
or
#SBATCH --clusters=htc
or
#SBATCH --clusters=all

If submitted with --cluster=all a job will simply be run on the first available resource, regardless of what cluster this is on.

Submission Scripts

As an example - to request two compute nodes, running 48 processes per node (using MPI), with one CPU per task (the default) requiring 2GB of memory per CPU, and a two hour wall time, the following submission script would be used:

#!/bin/bash 
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=48
#SBATCH --mem-per-cpu=2G
#SBATCH --time=02:00:00 
#SBATCH --job-name=myjob 
#SBATCH --partition=short 

module load mpitest/1.0

mpirun mpihello

 

To request a single core for 10 minutes, with one task on the node (and one CPU per task), requiring 8GB memory, a typical submission script would be:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --job-name=single_core
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=8G
#SBATCH --partition=short

module purge
module load testapp/1.0

#Calculate number of primes from 2 to 10000
prime 2 10000


Interactive Jobs
An interactive job logs you in to a compute node and gives you a shell. This allows users to interact with the node it in real time, much like one would interact with a desktop PC, or the login nodes. We now expect users to use interactive jobs in order to run pre/post processing and software build activities - and there are nodes dedicated to these tasks.

To start an interactive session, you need to use the srun command, e.g.

srun -p interactive --pty /bin/bash

or for a session that allows graphical interfaces (via X forwarding):

srun -p interactive --x11 --pty /bin/bash

This would allocate 1 core on one interactive node and log you in to the system (giving you a shell on the system). Multiple cores, memory, or other resources can be requested the same way as for sbatch.

Exiting the the shell ends the job. It will also be aborted once it exceeds the time limit.

GPU Resources
GPUs are only available on compute nodes which are part of the htc cluster. These resources are requested using the gres SLURM directive in your submission script.

 

The most basic way you can access a GPU is by requesting a GPU device using the gres option in your submission script:

#SBATCH --gres=gpu:1
The above will request 1 single GPU device (of any type) - this is the same as the method previously used on ARCUS-B/HTC. Note that - as with CPUs and memory - you will only be able to see the number of GPUs you requested.
 

You may also request a specific type of GPU device, for example:

#SBATCH --gres=gpu:v100:1

To request one V100 device, or:

#SBATCH --gres=gpu:rtx8000:2

To request two RTX8000 devices. Available devices are P100, V100, RTX (Titan RTX), RTX8000, and A100.

Alternatively you can request a GPU (--gres=gpu:1) and specify the type via a constraint on the GPU SKU, GPU generation, or GPU compute capability:

#SBATCH --gres=gpu:1 --constraint='gpu_sku:V100'

#SBATCH --gres=gpu:1 --constraint='gpu_gen:Pascal'

#SBATCH --gres=gpu:1 --constraint='gpu_cc:3.7'

#SBATCH --gres=gpu:1 --constraint='gpu_mem:32GB'

#SBATCH --gres=gpu:1 --constraint='nvlink:2.0'

Configured GPU related constraints are:

gpu_gen: GPU generation (Pascal, Volta, Turing, Ampere)
gpu_sku: GPU model (P100, V100, RTX, RTX8000, A100)
gpu_cc: CUDA compute capability
gpu_mem: GPU memory
nvlink: device has nvlink - contraint exist as simple (-C nvilink) and specifying version (-C 'nvilink:2.0')

 

For details on available options/combinations see the table of available GPUs.

Please note that co-investment GPU nodes are limited to short partition, i.e. the maximum job run time is 12 hours. No such restrictions apply to ARC owned GPUs. See the table of available GPUs for more information.

 

Application Software & Modules

The ARC/HTC software environment comprises a mixture of commercial applications, software built using the EasyBuild framework and software built using our own local build recipes. As with ARCUS-B/HTC we use the environment modules system (via the module command) to load applications into the environment on ARC/HTC.

The application module names have changed on ARC/HTC. You will therefore need to look up the new module name such that you can include this in your submission script - the best way to search for an application is by using the module spider command. For example, to search for the GROMACS application:

 

module spider gromacs

------------------------------------------------------------------------------------------------------------------------------
  GROMACS:
------------------------------------------------------------------------------------------------------------------------------
    Description:
      GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for
      systems with hundreds to millions of particles. This is a CPU only build, containing both MPI and threadMPI builds.

     Versions:
        GROMACS/2020-fosscuda-2019b
        GROMACS/2020.4-foss-2020a-PLUMED-2.6.2
        GROMACS/2020.4-foss-2020a

------------------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "GROMACS" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider GROMACS/2020.4-foss-2020a
------------------------------------------------------------------------------------------------------------------------------

 

The module spider command gives you a list of available GROMACS packages. Please note, module spider is NOT case-sensitive for searching, so:

module spider GROMACS
module spider gromacs
module spider Gromacs

... are all equivalent. However, when loading the module using module load you must obey the correct case e.g.

 

module load GROMACS/2020.4-foss-2020a

You will find more detailled advice on how to run some of the more popular applications under the applications & software section of our support pages.

You can also build your own software in your home or data directories using one of the compilers provided (which are also available through the environment modules system). Typically the compiler toolchains, including maths libraries and MPI can be loaded using the modules named foss (e.g. foss/2020a) for free open-source software (i.e. GCC) or intel (e.g. intel/2020a) for the Intel compiler suite. For example loading foss/2020a will include the following modules:

module load foss/2020a
module list

Currently Loaded Modules:
  1) GCCcore/9.3.0                 4) GCC/9.3.0                      7) libxml2/2.9.10-GCCcore-9.3.0     10) OpenMPI/4.0.3-GCC-9.3.0   13) FFTW/3.3.8-gompi-2020a
  2) zlib/1.2.11-GCCcore-9.3.0     5) numactl/2.0.13-GCCcore-9.3.0   8) libpciaccess/0.16-GCCcore-9.3.0  11) OpenBLAS/0.3.9-GCC-9.3.0  14) ScaLAPACK/2.1.0-gompi-2020a
  3) binutils/2.34-GCCcore-9.3.0   6) XZ/5.2.5-GCCcore-9.3.0         9) hwloc/2.2.0-GCCcore-9.3.0        12) gompi/2020a               15) foss/2020a

 

Some commercial applications only support CentOS 8.1 at their latest release level, so only the newest versions of these applications are loaded in the main software repository on ARC/HTC. To give some backwards compatibility we have a small number of CentOS 7.7 nodes which can use older releases of the affected applications (also see Partitions section).

Storage

Users have a $HOME area with a 15GB quota; this is the directory you are in when you log in. Users also have a $DATA area which shares a 5TB quota with your project colleagues, and a per-job $SCRATCH and $TMPDIR for temporary data/workfiles. $TMPDIR is local to a compute node; $SCRATCH is on a shared file system and available to all nodes in a job, if a job spans multiple nodes.
 
Both $SCRATCH and $TMPDIR are not persistent; they will be automatically removed on job exit. It is important that your job copies all files into your $DATA area before it exits; we will not be able to recover your data if you left it on $SCRATCH or $TMPDIR once a job finished.
 
As a rule we recommend that you use your $DATA area to store your data, but utilise the per job $SCRATCH or $TMPDIR area - especially for intermediate or temporary files. Generally you would copy all required input data at the start of your job and then copying results back to your $DATA area.
 
A simple example of how to do this would be:
 
#!/bin/bash

cd $SCRATCH || exit 1

rsync -av $DATA/myproject/input ./
rsync -av $DATA/myproject/bin ./ 

module load foss/2020b

mpirun ./bin/my_software

rsync -av --exclude=input --exclude=bin ./ $DATA/myproject/

This examples copied directories '$DATA/myproject/input' and '$DATA/myproject/bin' into $SCRATCH (which will then contain directories 'input' and 'bin'); runs './bin/my_software'; and copies all files in the $SCRATCH directory - excluding directories 'input' and 'bin' - back to $DATA/myproject/ once the mpirun finishes.

 
For more details on where you can store files, please see our Storage page.