.

The Torque job scheduler

Using the Torque job scheduler

To run an application, the user launches a job on one of the ARC  systems.  A job contains both the details of the processing to carry out (name and version of the application, input and output, etc.) and directives for the computer resources needed (number of cpus, amount of memory).

Jobs are run as batch jobs, i.e. in an unattended manner.  Typically, a user logs in on one of the ARC systems, sends a job to the execution queue and often logs out.

Jobs are managed by a job scheduler, a piece of software which is in charge of

  • allocating the computer resources requested for the job,
  • running the job and
  • reporting back to the user the outcome of the execution.

Running a job involves at the minimum the following steps

  • preparing a submission script and
  • submitting the job to execution.

The ARC systems use a job scheduler called Torque Resource Manager, an advanced open-source product based on the original PBS project.  Torque integrates with the Maui Workload Manager in order to improve overall utilisation, scheduling and administration on a cluster.

This guide describes basic job submission and monitoring for Torque:

In addition, some more advanced topics are covered:

 


Preparing a submission script

A submission script is a shell script that

  • describes the processing to carry out (e.g. the application, its input and output, etc.) and
  • requests computer resources (number of cpus, amount of memory) to use for processing.

Suppose we want to run a molecular dynamics MPI application called foo with the following requirements

  • the run uses 32 processes,
  • the job will not run for more than 100 hours,
  • the job is given the name "protein123" and
  • the user should be emailed when the job starts and stops or aborts.

Assuming the number of cores available on each cluster node is 16, so a total of 2 nodes are required to map one MPI process to a physical core.  Supposing no input needs to be specified, the following submission script runs the application in a single job

#!/bin/bash

# set the number of nodes and processes per node
#PBS -l nodes=2:ppn=16

# set max wallclock time
#PBS -l walltime=100:00:00

# set name of job
#PBS -N protein123

# mail alert at start, end and abortion of execution
#PBS -m bea

# send mail to this address
#PBS -M john.brown@gmail.com

# use submission environment #PBS -V # start job from the directory it was submitted cd $PBS_O_WORKDIR # define MPI host details (ARC specific script) . enable_arcus_mpi.sh # run through the mpirun launcher mpirun $MPI_HOSTS foo

The script starts with #!/bin/bash (also called a shebang), which makes the submission script also a Linux bash script.

The script continues with a series of lines starting with #.  For bash scripts these are all comments and are ignored.  For Torque, the lines starting with #PBS are directives requesting job scheduling resources.  (NB: it's very important that you put all the directives at the top of a script, before any other commands; any #PBS directive coming after a bash script command is ignored!)

The final part of a script is normal Linux bash scripting and describes the set of operations to follow as part of the job.  In this case, this involves running the MPI-based application foo through the MPI utility mpirun.

The resource request #PBS -l nodes=n:ppn=m is the most important of the directives in a submission script.  The first part (nodes=n) is imperative and is determines how many compute nodes a job is allocated by the scheduler. The second part (ppn=m) is used by the scheduler to prepare the environment for a MPI parallel run with m processes per each compute nodes (e.g. writing a hostifile for the job, pointed to by $PBS_NODEFILE).  However, it is up to the user and the submission script to use the environment generated from ppn adequately.  In the example above, this is done first  sourcing enable_arcus_mpi.sh (which uses $PBS_NODEFILE to prepare the variable $MPI_HOSTS) and then running the application through mpirun.

A note of caution is on threaded single process applications (e.g. Gaussian and Matlab).  They cannot run on more than a single compute node; allocating more (e.g. #PBS -l nodes=2) will end up with the first node being allocated and the rest idle.  Moreover, since there is no automatic effect on runs from using ppn, the only relevant resource scheduling request in the case of single process applications remains #PBS -l nodes=1.  This gives a job user-exclusive access to a single compute node, allowing the application to use all available cores and physical memory on the node; these vary from system to system, see the Table below.

Arcus 16 cores 64GB
Caribou 16 cores 128GB

Examples of Torque submission scripts are given here for some of the more popular applications.

 


PBS job submission directives

Directives are job specific requirements given to the job scheduler.

The most important directives are those that request resources.  The most common are the wallclock time limit (the maximum time the job is allowed to run) and the number of processors required to run the job.  For example, to run an MPI job with 16 processes for up to 100 hours on a cluster with 8 cores per compute node, the PBS directives are

#PBS -l walltime=100:00:00
#PBS -l nodes=2:ppn=16

A job submitted with these requests runs for 100 hours at most; after this limit expires, the job is terminated regardless of whether the processing finished or not.  Normally, the wallclock time should be conservative, allowing the job to finish normally (and terminate) before the limit is reached.

Also, the job is allocated two compute nodes (nodes=2) and each node is scheduled to run 16 MPI processes (ppn=8).  (ppn is an abbreviation of Processes Per Node.)  It is the task of the user to instruct mpirun to use this allocation appropriately, i.e. to start 32 processes which are mapped to the 32 cores available for the job.  More information on how to run MPI application can be found in this guide.

 


Submitting jobs with the command qsub

Supposing you already have a submission script ready (call it submit.sh), the job is submitted to the execution queue with the command qsub script.sh.  The queueing system prints a number (the job id) almost immediately and returns control to the linux prompt.  At this point the job is already in the submission queue.

Once you have submitted the job it will sit in a pending queue for some time (how long depends on the demands of your job and the demand on the service).  You can monitor the progress of the job using the command qstat.

Once the job is run you will see files with names like "job.e1234" and "job.o1234", either in your home directory or in the directory you submitted the job from (depending on how your job submission script is written).  The ".e" files contain error messages.  The ".o" files contain "standard output" which is essentially what the application you ran would normally have printed onto the screen.  The ".e" file contains the possible error messages issued by the application; on a correct execution without errors, this file can be empty.

Read all the options for qsub on the Linux manual using the command man qsub.

 


Monitoring jobs with the command qstat

qstat is the main command for monitoring the state of systems, groups of jobs or individual jobs.  The simple qstat command gives a list of jobs which looks something like this:

Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1121.headnode1    jobName1         bob               15:45:05 R priorityq       
1152.headnode1    jobName2         mary              12:40:56 R workq       
1226.headnode1    jobName3         steve                    0 Q workq

The first column gives the job ID, the second the name of the job (specified by the user in the submission script) and the third the owner of the job.  The fourth column gives the elapsed time for each particular job.  The fifth column is the status of the job (R=running, Q=waiting, E=exiting, H=held, S=suspended).  The last column is the queue for the job (a job scheduler can manage different queues serving different purposes).

Some other useful qstat features include:

  • -u for showing the status of all the jobs of a particular user, e.g. qstat -u bob for user bob;
  • -n for showing the nodes allocated by the schedulerr for a running job;
  • -i for showing the status of a particular job, e.g. qstat -i 1121 for job with the id 1121.

Read all the options for qstat on the Linux manual using the command man qstat.

 


Deleting jobs with the command qdel

Use the qdel command to delete a job, e.g. qdel 1121 to delete job with id 1121.  A user can delete own jobs at any time, whether the job is pending (waiting in the queue) or running.  A user cannot delete the jobs of another user.  Normally, there is a (small) delay between the execution of the qdel command and the time when the job is dequeued and killed.  Occasionally a job may not delete properly, in which case, the ARC support team can delete it.

 


Environment variables

At the time a job is launched into execution, Torque defines multiple environment variables, which can be used from within the submission script to define the correct workflow of the job.  The most useful of these environment variables are the following:

  • PBS_O_WORKDIR, which points to the directory where the qsub command is issued,
  • PBS_NODEFILE, which point to a file that lists the hosts (compute nodes) on which the job is run,
  • PBS_JOBID, which is a unique number PBS assigns to a job and
  • TMPDIR, which points to a directory on the scratch (local and fast) disk space that is unique to a job.

PBS_O_WORKDIR is typically used at the beginning of a script to go to the directory where the qsub command was issued, which is frequently also the directory containing the input data for the job, etc.  The typical use is

cd $PBS_O_WORKDIR

inside a submission script.

PBS_NODEFILE is typically used to define the environment for the parallel run, for mpirun in particular.  Normally, this usage is hidden from users inside a script (e.g. enable_arcus_mpi.sh), which defines the environment for the user.

PBS_JOBID is useful to tag job specific files and directories, typically output files or run directories.  For instance, the submission script line

myApp > $PBS_JOBID.out

runs the application myApp and redirects the standard output to a file whose name is given by the job id.  (NB: the job id is a number assigned by Torque and differs from the character string name given to the job in the submission script by the user.)

TMPDIR is the name of a scratch disk directory unique to the job.  The scratch disk space typically has faster access than the disk space where the user home and data areas reside and benefits applications that have a sustained and large amount of I/O.  Such a job normally involves copying the input files to the scratch space, running the application on scratch and copying the results to the submission directory.  This usage is discussed in a separate section.

 


Array jobs

Arrays are a feature of Torque which allows users to submit a series of jobs using a single submission command and a single submission script.  A typical use of this is the need to batch process a large number of very similar jobs, which have similar input and output, e.g. a parameter sweep study.

A job array is a single job with a list of sub-jobs.  To submit an array job, use the -t flag to describe a range of sub-job indices.  For example

qsub -t 1-100 script.sh

submits a job array whose sub-jobs are indexed from 1 to 100.  Also,

qsub -t 100-200 script.sh

submits a job array whose sub-jobs are indexed from 100 to 200.  Furthermore,

qsub -t 100,200,300 script.sh

submits a job array whose sub-jobs indices are 100, 200 and 300.

The typical submission script for a job array uses the index of each sub-job to define the task specific for each sub-job, e.g. the name of the input file or of the output directory.  The sub-job index is given by the PBS variable PBS_ARRAYID.  To illustrate its use, consider the application myApp processes some files named input_*.dat (taken as input), with * ranging from 1 to 100.  This processing is described in a single submission script called submit.sh, which contains the following line

myApp < input_$PBS_ARRAYID.dat > output_$PBS_ARRAYID.dat

A job array is submitted using this script, with the command qsub -t 1-100 script.sh.  When a sub-job is executed, the file names in the line above are expanded using the sub-job index, with the result that each sub-job processes a unique input file and outputs the result to a unique output file.

Once submitted, all the array sub-jobs in the queue can be monitored using the extra -t flag to qstat.

 


Using scratch disk space

At present, the use of scratch space (pointed to by the variable TMPDIR) does not offer any performance advantages over the disk space pointed to by $DATA.  Users are then advised to avoid using the scratch space on the ARC resources.  We have plans for infrastructure upgrade, in which performant storage can be used as fast access scratch disk space from within jobs.

 


Jobs with conditional execution

It is possible to start a job on the condition that another one completes beforehand; this may be necessary for instance if the input to one job is generated by another job. Job dependency is defined  using the -W flag.

To illustrate with an example, suppose you need to start a job using the script second_job.sh after another job finished successfully. Assume the first job is started using script first_job.sh and the command to start the first job

qsub first_job.sh

returns the job ID 7777. Then, the command to start the second job is

qsub -W depend=after:7777 second_job.sh

This job dependency can be further automated (possibly to be included in a bash script) using environment variables:

JOB_ID_1=`qsub first_job.sh`
JOB_ID_2=`qsub -W depend=after:$JOB_ID_1 second_job.sh`

Furthermore, the conditional execution above can be changed so that the execution of the second job starts on the condition that the execution of the first was successful.  This is achieved replacing after with afterok, e.g.

JOB_ID_2=`qsub -W depend=afterok:$JOB_ID_1 second_job.sh`

Conditional submission (as well as conditional submission after successful execution) is also possible with job arrays.  This is useful, for example, to submit a "synchronization" job (script sync_job.sh) after the successful execution of an entire array of jobs (defined by array_job.sh).  The conditional execution uses afterokarray instead of afterok:

JOB_ARRAY_ID=`qsub -t 2-6 array_job.sh`
JOB_SYNC_ID=`qsub -W depend=afterokarray:$JOB_ARRAY_ID sync_job.sh`