.

Arcus GPU

Using the Slurm job scheduler

Important note

This page is in the process of being updated. Apologies for any inconvenience.

This guide is for the arcus-gpu system, which runs the job scheduler Slurm.  Slurm is an advanced open-source scheduler used on many of the most powerful computers in the world.

Introduction

To run an application on arcus-gpu system, the user submits a job to the Slurm scheduler from the login node.  A job contains both the details of the processing to carry out (name and version of the application, input and output, etc.) and directives for the computer resources needed (number of cpus, GPUs, etc.).

Jobs are run as batch jobs, i.e. in an unattended manner.  Typically, a user logs in on the arcus-gpu login node, prepares a job, sends it to the execution queue and logs out.

Jobs are managed by a Slurm, which is in charge of

  • allocating the computer resources requested for the job,
  • running the job and
  • reporting the outcome of the execution back to the user.

Running a job involves, at the minimum, the following steps

  • preparing a submission script and
  • submitting the job to execution.

This guide describes basic job submission and monitoring for Slurm, with particular emphasis on using the GPU resources efficiently.  The topis in the guide are:

 


Commands

The table below gives a short description of the most used Slurm commands.

command description
sacct report job accounting information about active or completed jobs
salloc allocate resources for a job in real time (typically used to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks)
sbatch submit a job script for later execution (the script typically contains one or more srun commands to launch parallel tasks)
scancel cancel a pending or running job
sinfo reports the state of partitions and nodes managed by Slurm (it has a variety of filtering, sorting, and formatting options)
squeue reports the state of jobs (it has a variety of filtering, sorting, and formatting options), by default, reports the running jobs in priority order followed by the pending jobs in priority order
srun used to submit a job for execution in real time


Preparing a submission script

A submission script is a shell script that

  • describes the processing to carry out (e.g. the application, its input and output, etc.) and
  • requests computer resources (number of cpus, number of GPUs, amount of memory, etc.) to use for processing.

The simplest case is that of a job that requires a single node and and a single GPU, with the following requirements:

  • the job uses 1 node,
  • the application is a single process,
  • the application is accelerated by a GPU device
  • the job will run for no more than 100 hours,
  • the job is given the name "test123" and
  • the user should be emailed when the job starts and stops or aborts.

Example 1: job running on a single node

Supposing the application is called appGPU and takes no command line arguments, the following submission script runs the application in a single job

#!/bin/bash

# set the partition where the job will run
#SBATCH --partition=k20
# set the number of nodes
#SBATCH --nodes=1

# set the number of GPU cards to use per node
#SBATCH --gres=gpu:1 # set max wallclock time
#SBATCH --time=100:00:00 # set name of job #SBATCH --job-name=test123 # mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# send mail to this address
#SBATCH --mail-user=john.brown@gmail.com

# run the application appGPU

The script starts with #!/bin/bash (also called a shebang), which makes the submission script a Linux bash script.

The script continues with a series of lines starting with #, which represen bash scripts comments.  For Slurm, the lines starting with #SBATCH are directives that request job scheduling resources.  (Note: it's very important that you put all the directives at the top of a script, before any other commands; any #SBATCH directive coming after a bash script command is ignored!)

The first directive states the partition where the job is going to run.  A partition is a collection of nodes, and can also be seen as a queue (more details about partitions in the Slurm partitions section below).  Note: jobs must specify a partition; Slurm is configured without any DEFAULT partition, thus forcing users to specify one.

The resource request #SBATCH --nodes=n determines how many compute nodes a job are allocated by the scheduler; only 1 node is allocated for this job.  A note of caution is on threaded single process applications (e.g. Matlab).  These applications cannot run on more than a single compute node; allocating more (e.g. #SBATCH --nodes=2) will end up with the first node being busy and the rest idle.

An easy way to obtain a job user-exclusive access to a single compute node (allowing the application to use all available cores, physical memory and GPU cards on the node) is to add #SBATCH --exclusive. Alternatively,

  • the number of cores per node can be specified with #SBATCH --ntasks-per-node=twhere t is the number of cores required per node, and
  • the number of GPU cards can be specified with #SBATCH --gres=gpu:g, which allocates g cards per node and sets up the (local) environment variable CUDA_VISIBLE_DEVICES that points to the allocated devices -- possible values of the variable in a 2-GPU-card node are "0", "1", and "0,1".

The maximum walltime is specified by #SBATCH --time=T, where T has format h:m:s.  Normally, a job is expected to finish before the specified maximum walltime.  After the walltime reaches the maximum, the job terminates regardless whether the job processes are still running or not. 

The name of the job can be specified too with #SBATCH --job-name="name".

Lastly, an email notification is sent if an address is specified with #SBATCH --mail-user=<email_address>.  The notification options can be set with #SBATCH --mail-type=<type>, where <type> may be BEGIN, END, FAIL, REQUEUE or ALL (for any change of job state).

The final part of a script is normal Linux bash script and describes the set of operations to follow as part of the job.  The job starts in the same folder where it was submitted (unless an alternative path is specified), and with the same environment variables (modules, etc.) that the user had at the time of the submission.  In this example, this final part only involves invoking the appGPU application executable.

Example 2: job running on multiple nodes

As a second example, suppose we want to run an MPI application called appGPU with the following requirements

  • the run uses 2 nodes,
  • each node runs 2 processes and each process is accelerated by one GPU device
  • the job will not run for more than 100 hours,
  • the job is given the name "test123" and
  • the user should be emailed when the job starts and stops or aborts.

Supposing no input needs to be specified, the following submission script runs the application in a single job

#!/bin/bash
# set the partition where the job will run
#SBATCH --partition=k20
# set the number of nodes and processes per node
#SBATCH --nodes=2

# set the number of tasks (processes) per node.
#SBATCH --ntasks-per-node=2
# set the number of GPU cards per node
#SBATCH --gres=gpu:2

# set max wallclock time
#SBATCH --time=100:00:00 # set name of job #SBATCH --job-name=test123 # mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# send mail to this address
#SBATCH --mail-user=john.brown@gmail.com

# the job starts from the directory it was submitted
# using the environment at the launch time.

# run through the mpirun launcher srun -n$SLURM_NTASKS appGPU

In large part, the script above is similar to the one for a single node job.  In this example, #SBATCH --ntasks-per-node=m is used to reserve m cores per node (when not specified, a single core is allocated) and to prepare the environment for a MPI parallel run with m processes per each compute node.

While an MPI application can still be launched into execution via the more familiar method using the mpirun launcher, Slurm provides a simple mechanism through the srun command.  srun takes as argument the total number of MPI processes to start (provided by SLURM_NTASKS) and starts instances of appGPU distributed on the allocated compute nodes according to the --ntasks-per-node=m request.  The scheduler takes care of the lanch details in an automatic way.

The preferred MPI implementation on arcus-gpu is MVAPICH2.


Slurm partitions

Partitions on arcus-gpu are collections of nodes.  Partitions are created to group nodes that share a single set of hardware specification, such that jobs that execute on the nodes of a partition can expect symmetric processing, with the same performance on all nodes.

Currently (September 2014), arcus-gpu has a four partitions called k10, k20, devel and k40. The k10 partition has 6 nodes, each of which has a single CPU (6 hyperthreaded cores), 64GB RAM and 2 Nvidia K10 cards (effectively 4 GPU units per node). The k20 partition has 7 nodes, each of which has a single CPU (6 hyperthreaded cores), 64GB of RAM and 2 Nvidia K20 cards. The devel partition is a single compute node with the same specification as the k20 nodes but with a maximum walltime of 1 hour. The k40 partition has 4 nodes, each of which has two CPUs (12 hyperthreaded cores in total), 64GB RAM and 2 Nvidia K40 cards. The following table (containing data from http://www.nvidia.com/object/tesla-servers.html and https://developer.nvidia.com/cuda-gpus) describes the characteristics of each GPU card.

  Tesla K40 Tesla K20m Tesla K10
Number and Type of GPU 1 Kepler GK110B 1 Kepler GK110 2 Kepler GK104s
Memory size (GDDR5 - without ECC) 12 GB 5 GB 8 GB
CUDA cores 2880 2496 2 x 1536
CUDA Compute Capability 3.5 3.5 3.0

In future, arcus-gpu may have other partitions with other accelerators.  The following table shows the k10, k20 and k40 partitions.

partition name #nodes CPU/node info cores/node (with hyperthreading)
mem/node cards/node interconnect
k40 4 2xIntel Xeon    E5-2620 v2 @2.1GHz 24 cores 64GB 2xK40 FDR Infiniband
k20 7 1xIntel Xeon E5-1650 @3.20GHz 12 cores 64GB 2xK20m FDR Infiniband
devel 1 1xIntel Xeon E5-1650 @3.20GHz 12 cores 64GB 2xK20m FDR Infiniband
k10 6 1xIntel Xeon E5-1650 @3.20GHz 12 cores 64GB 2xK10 G2(4 GPU units) FDR Infiniband

Because most applications use a single GPU card, and more than one card per node is available, the nodes of different partitions are configured in a shared mode, meaning that several jobs can run on the same node at the same time as long as there are resources available.  The single resource that is not shared is the physical CPU core.  Thus, a 6 hyper-threaded core node having 12 cores in total as seen by the operating system (such as the k20 and k10 partition nodes) can only fit 4 jobs with the request #SBATCH --ntasks-per-node=3.  In the case that a user wants exclusive access to nodes the clause #SBATCH --exclusive can be used.

Note: jobs on the arcu-gpu system must specify a partition.  Slurm is configured without any DEFAULT partition, thus forcing users to specify one.


Slurm job submission directives

Directives are job specific requirements given to the job scheduler.

The most important directives are those that request resources.  The most common are the wallclock time limit (the maximum time the job is allowed to run) and the number of processors required to run the job.  For example, to run an MPI job with 16 processes for up to 100 hours on a cluster with 8 cores per compute node, the Slurm directives are

#SBATCH --time=100:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

A job submitted with these requests runs for 100 hours at most; after this limit expires, the job is terminated regardless of whether the processing finished or not.  Normally, the wallclock time should be conservative, allowing the job to finish normally (and terminate) before the limit is reached.

Also, the job is allocated two compute nodes (nodes=2) and each node is scheduled to run 8 MPI processes (--ntasks-per-node=8). Again, using MVAPICH2, the application can be launched through mpirun -n $SLURM_NTASKS <mpi-executable> and the scheduler will take care to place each process onto the correct node.

Where available, GPU cards can be requested too. The following directive:

#SBATCH --gres=gpu:2

will request two GPU cards per node.

Currently, the only partition with GPU cards is composed of a set of 8 nodes with 2 K20 GPU cards per node. In order to submit a job to this partition, the user should specify:

#SBATCH --partition=k20


Submitting jobs with the command sbatch

Supposing you already have a submission script ready (call it submit.sh), the job is submitted to the execution queue with the command sbatch script.sh.  The queueing system prints a number (the job id) almost immediately and returns control to the linux prompt.  At this point the job is in the submission queue.

Once you have submitted the job, it will sit in a pending queue for some time (how long depends on the demands of your job and the work load of the service).  You can monitor the progress of the job using the command squeue (see below).

Once the job starts to run you will see files with names such as slurm-1234.out either in the directory you submitted the job from (default behaviour) or in the directory where the script was instructed explicitly to cd to.  Apart from the .out files, there are .o files contain the output the application normally prints to the screen (standard output) but mixed with error messages issued by the application (standard error).  Separate files for standard output and error can be asked for using

#SBATCH --error=myRecord.err
#SBATCH --output=myRecord.out

Additionally, there is the option of using patterns of the file names.

Almost every Slurm directive has a single-letter alternative (for example, -e can be used instead of --error).  The long names were used in this guid for clarity.  Read all the options for sbatch on the Linux manual using the command man sbatch.


Submitting interactive jobs

To test (or debug) an application using a GPU card, an interactive job can be obtained by using one of the following methods.

The first method is exemplified with the command:

salloc -pk20 --gres=gpu:1 srun --gres=gpu:1 --pty --preserve-env /bin/bash -l

that makes a specific request for a GPU card on the k20 partition and, once the allocation is made, directly lands the user on a node from that partition with a free GPU card. You should adjust the number of gpu cards, as well as the number of nodes and CPUs to your needs.


Monitoring jobs with the command squeue

squeue is the main command for monitoring the state of systems, groups of jobs or individual jobs.

The command squeue prints the list of current jobs.  The list looks something like this:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
2497 k20 test1.14 bob R 0:07 1 k20n002
2499 k20 test1.35 mary R 0:22 4 k20n[003-006]
2511 k20 ask.for. steve PD 0:00 4 (Resources)

The first column gives the job ID, the second the queue or partition where the job was submited, the third the name of the job (specified by the user in the submission script) and the fourth the owner of the job.  The fifth is the status of the job (R=running, PD=pending, CA=cancelled, CF=configuring, CG=completing, CD=completed, F=failed). The sixth column gives the elapsed time for each particular job.  Finally, there are the number of nodes requested and the nodelist where the job is running (or the cause that it is not running).

Some other useful squeue features include:

  • -u for showing the status of all the jobs of a particular user, e.g. squeue -u bob for user bob;
  • -l for showing more of the  available information;
  •  --start to report  the  expected  start  time  of pending jobs.

Read all the options for squeue on the Linux manual using the command man squeue, including how to personalize the information to be displayed.


Deleting jobs with the command scancel

Use the scancel command to delete a job, e.g. scancel 1121 to delete job with ID 1121.  A user can delete his/her own jobs at any time, whether the job is pending (waiting in the queue) or running.  A user cannot delete the jobs of another user.  Normally, there is a (small) delay between the execution of the scancel command and the time when the job is dequeued and killed.  Occasionally a job may not delete properly, in which case, the ARC support team can delete it upon request.


Environment variables

At the time a job is launched into execution, Slurm defines multiple environment variables, which can be used from within the submission script to define the correct workflow of the job.  The most useful of these environment variables are the following:

  • SLURM_SUBMIT_DIR, which points to the directory where the sbatch command is issued;
  • SLURM_JOB_NODELIST, which returns the list of nodes allocated to the job;
  • SLURM_JOB_ID, which is a unique number Slurm assigns to a job.

In most cases, SLURM_SUBMIT_DIR does not have to be used, as the job goes by default to the directory where the slurm command was issued.  This behaviour of Slurm is in contrast with other schedulers, such as Torque, which goes to the home directory of the user account.  SLURM_SUBMIT_DIR can be useful in a submission script when files must be copied to/from a specific directory that is different from the directory where the slurm command was issued.

SLURM_JOB_ID is useful to tag job specific files and directories, typically output files or run directories.  For instance, the submission script line

myApp > $SLURM_JOB_ID.out

runs the application myApp and redirects the standard output to a file whose name is given by the job ID.  The job ID is a number assigned by Slurm and differs from the character string name given to the job in the submission script by the user.