ARC Service Level Agreements

Background

Advanced Research Computing (ARC) specialises in parallel high performance computing. Researchers should be aware that ARC systems are intended foremost for running applications which are parallel (or could be made parallel) and/or CPU intensive and/or use large amounts of memory.

The provision of high performance computing facilities at Oxford, through the ARC, is supported through a hybrid model funding the basic provision of the service through indirect charges with additional direct costs from grants for the large users of the facility. The aim is to provide a basic service with a high-level service provision for the large users who are paying directly (i.e. faster turnaround time, priority application support etc).

As a result of this funding model the ARC must have a clearly stated and controlled usage policy which results in well-defined and guaranteed service level agreements (SLAs). This will be achieved by use of a resource allocation software stack and the implementation of a detailed resource allocation policy.

Service Levels

The ARC allows a range of service levels with different features. These are associated with different quality of service definitions with the job scheduler, enforcing different job priorities, maximum job sizes and maximum run times.

Groups may wish to run some jobs (for projects with funding) as SL1 and others (with no funding) as SL3.  This is easily achieved by opening separate accounts for each SL.

Service Level 1 – Priority Usage

Service level 1 (SL1) operates with the highest quality of service QoS1, designed for groups which require sustained amounts of computer time and are able to pay for that time through research grant income.

Credits will be purchased in advance and will be available for use until exhausted. When SL1 users exhaust their credits they move down to SL3, unless more credits are purchased. Projects reaching low credit will be informed to allow them to top up their accounts in time.

Service Level 2 – Teaching & Development

Service level 2 (SL2) operates with the medium level service QoS2 and is designed to offer access for specific, defined activities.

The teaching service level applies to courses that the ARC is running for users. A number of nodes will be manually allocated for members of the teaching group.

Development work comprises small, short, quick jobs required for testing of codes. 

Service Level 3 and 4 – Standard Usage

Service level 3 and 4 (SL3, SL4) operates with the lowest quality of service, QoS3, QoS4. This service level is designed for users who currently do not have funding to pay for their direct usage. SL3 jobs will be scheduled around QoS1 & 2 jobs and only smaller core count jobs can be run. Users relying on SL3 to run a job should expect longer wait time.

SL4 provides for users with no funding or departmental support and offers a minimal service. Users relying on SL4 will wait the longest for jobs to start.

QoS descriptions

QoS 1 – highest quality of service

QoS 1 (priority QoS) is associated with projects that fund ARC access time directly from grants or equivalent. QoS1 is characterised by

  • jobs have the highest priority and will move through the queue fastest;
  • jobs have a default maximum job run time of 120 hours. This can be extended on request;
  • users will also have priority access to the ARC Scientific Computing advisor.

QoS 2 – teaching & development

To support code development efforts and teaching loads. It is associated with a dedicated SLURM partition. QoS 2 targets

  • jobs with requests for <= 10 minutes wall time

QoS 3 – Standard quality of service

This represents the QoS in which the vast majority of ARC users are found. It is characterised by

  • jobs have a lower priority and jobs will be scheduled around priority QoS jobs;
  • jobs have a maximum job run time of 120 hours;
  • user access will be managed through a fairshare policy.

QoS4 – Basic 'free' quality of service

The lowest level of service is characterised by

  • jobs have a lower priority and jobs will be scheduled around QoS 1, 2 & 3 jobs in the queue;
  • jobs have a maximum job run time of 24 hours;
  • users may run up to 1 concurrent jobs;
  • user access will be managed through a fairshare policy.

Availability and planned maintenance

General availability

Every reasonable effort will be made to keep ARC resources available and operational 24 hours per day and 7 days per week.

Please note however that although the support personnel will do their best to keep the facility running at all times, we cannot guarantee to promptly resolve problems outside UK office hours, and during weekends and public holidays. Nevertheless, please notify support@arc.ox.ac.uk of issues whenever they arise.

Exceptional maintenance and unplanned disruptions

It may happen that despite best efforts, it becomes necessary to reduce or withdraw service at short notice and/or outside the planned maintenance time slot. This may happen e.g. for environmental reasons, such as air conditioning or power failure, or in an emergency where immediate shutdown is required to save equipment or data.

It is hoped that these situations will arise rarely. Obviously, in such cases service will be restored as rapidly as possible. Please see our system status page for latest updates.