.

The PBS job scheduler

Using the PBS job scheduler

To run an application, the user launches a job on one of the OSC systems.  A job contains both the details of the processing to carry out (name and version of the application, input and output, etc.) and directives for the computer resources needed (number of cpus, amount of memory).

Jobs are run as batch jobs, i.e. in an unattended manner.  Typically, a user logs in on one of the OSC systems, sends a job to the execution queue and often logs out.

Jobs are managed by a job scheduler, a piece of software which is in charge of

  • allocating the computer resources requested for the job,
  • running the job and
  • reporting back to the user the outcome of the execution.

Running a job involves at the minimum the following steps

  • preparing a submission script and
  • submitting the job to execution.

The OSC uses a commercial job scheduler called PBS Professional.  This guide describes basic job submission and monitoring for PBS Professional:

In addition, some more advanced topics are covered:

 


Preparing a PBS submission script

A submission script is a shell script that

  • describes the processing to carry out (e.g. the application, its input and output, etc.) and
  • requests computer resources (number of cpus, amount of memory) to use for processing.

Suppose we want to run a molecular dynamics MPI application called foo with the following requirements

  • the run uses 32 processes,
  • the job will not run for more than 100 hours,
  • the job is given the name "protein123" and
  • the user should be emailed when the job starts and stops or aborts.

The number of processors available on each cluster node is 8, so a total of 4 nodes are required.  Supposing no input needs to be specified, the following PBS submission script should do the job

#!/bin/bash

# set the number of nodes and processes per node
#PBS -l select=4:mpiprocs=8

# set max wallclock time
#PBS -l walltime=100:00:00

# set name of job
#PBS -N protein123

# mail alert at (b)eginning, (e)nd and (a)bortion of execution
#PBS -m bea

# send mail to the following address
#PBS -M my.address@dept.ox.ac.uk

# use submission environment #PBS -V # start job from the directory it was submitted cd $PBS_O_WORKDIR # define MPI host details . enable_hal_mpi.sh # run through the mpirun launcher mpirun $MPI_HOSTS foo

The script starts with #!/bin/bash (also called a shebang), which makes the submission script also a Linux bash script.

The script continues with a series of lines starting with #.  For bash scripts these are all comments and are ignored.  For PBS, the lines starting with #PBS are directives requesting job scheduling resources.  (NB: it's very important that you put all the PBS directives at the top of a script, before any other commands are used; any #PBS directive coming after a bash script command is ignored by PBS!)

The final part of a script is normal Linux bash scripting and describes the set of operations to follow as part of the job.  In this case, this involves running the MPI-based application foo through the MPI utility mpirun.

Here are in-detail examples of PBS submission scripts:

  • per system (Hal and Sal, Caribou, Skynet, Arcus) [...under construction...]
  • per application.

 


PBS job submission directives

Directives are job specific requirements given to the job scheduler.

The most important directives are those that request resources.  The most common are the wallclock time limit (the maximum time the job is allowed to run) and the number of processors required to run the job.  For example, to run an MPI job with 16 processes for up to 100 hours on a cluster with 8 cores per compute node, the PBS directives are

#PBS -l walltime=100:00:00
#PBS -l select=2:mpiprocs=8

A job submitted with these requests runs for 100 hours at most; after this limit expires, the job is terminated regardless of whether the processing finished or not.  Normally, the wallclock time should be conservative, allowing the job to finish normally (and terminate) before the limit is reached.

Also, the job is allocated two compute nodes (select=2) and each node is scheduled to run 8 MPI processes (mpiprocs=8).  It is the task of the user to instruct mpirun to use this allocation appropriately, i.e. to start 16 processes which are mapped to the 16 cores available for the job.  More information on how tu run MPI application can be found in this guide.

 


Submitting jobs with the command qsub

Supposing you already have a PBS submission script ready (call it submit.sh), the job is submitted to the execution queue with the command qsub script.sh.  The queueing system prints a number (the job id) almost immediately and returns control to the linux prompt.  At this point the job is already in the submission queue.

Once you have submitted the job it will sit in a pending queue for some time (how long depends on the demands of your job and the demand on the service).  You can monitor the progress of the job using the command qstat.

Once the job is run you will see files with names like "job.e1234" and "job.o1234", either in your home directory or in the directory you submitted the job from (depending on how your job submission script is written).  The ".e" files contain error messages.  The ".o" files contain "standard output" which is essentially what the application you ran would normally have printed onto the screen.  The ".e" file contains the possible error messages issued by the application; on a correct execution without errors, this file can be empty.

Read all the options for qsub on the Linux manual using the command man qsub.

 


Monitoring jobs with the command qstat

qstat is the main command for monitoring the state of systems, groups of jobs or individual jobs.  The simple qstat command gives a list of jobs which looks something like this:

Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1121.hal          jobName1         bob               15:45:05 R priorityq       
1152.hal          jobName2         mary              12:40:56 R workq       
1226.hal          jobName3         steve                    0 Q workq

The first column gives the job ID, the second the name of the job (specified by the user in the submission script) and the third the owner of the job.  The fourth column gives the elapsed time for each particular job.  The fifth column is the status of the job (R=running, Q=waiting, E=exiting, H=held, S=suspended).  The last column is the queue for the job (a job scheduler can manage different queues serving different purposes).

Some other useful qstat features include:

  • -u for showing the status of all the jobs of a particular user, e.g. qstat -u bob for user bob;
  • -p for showing time as percentage of the wallclock requested in the submission script;
  • -i for showing the status of a particular job, e.g. qstat -i 1121 for job with the id 1121.

Read all the options for qstat on the Linux manual using the command man qstat.

 


Deleting jobs with the command qdel

Use the qdel command to delete a job, e.g. qdel 1121 to delete job with id 1121.  A user can delete own jobs at any time, whether the job is pending (waiting in the queue) or running.  A user cannot delete the jobs of another user.  Normally, there is a (small) delay between the execution of the qdel command and the time when the job is dequeued and killed.  Occasionally a job may not delete properly, in which case, the OSC support team can delete it.

 


PBS environment variables

At the time a job is launched into execution, PBS defines multiple environment variables, which can be used from within the submission script to define the correct workflow of the job.  The most useful of these environment variables are the following:

  • PBS_O_WORKDIR, which points to the directory where the qsub command is issued,
  • PBS_NODEFILE, which point to a file that lists the hosts (compute nodes) on which the job is run,
  • PBS_JOBID, which is a unique number PBS assigns to a job and
  • TMPDIR, which points to a directory on the scratch (local and fast) disk space that is unique to a job.

PBS_O_WORKDIR is typically used at the beginning of a script to go to the directory where the qsub command was issued, which is frequently also the directory containing the input data for the job, etc.  The typical use is

cd $PBS_O_WORKDIR

used inside a submission script.

PBS_NODEFILE is typically used to define the environment for the parallel run, for mpirun in particular.  Normally, this usage is hidden from users inside a script (e.g. enable_hal_mpi.sh), which defines the environment for the user.

PBS_JOBID is useful to tag job specific files and directories, typically output files or run directories.  For instance, the submission script line

myApp > $PBS_JOBID.out

runs the application myApp and redirects the standard output to a file whose name is given by the job id.  (NB: the job id is a number assigned by PBS and differs from the character string name given to the job in the submission script by the user.)

TMPDIR is the name of a scratch disk directory unique to the job.  The scratch disk space has faster access than the disk space where the user home and data areas reside and benefits applications that have a sustained and large amount of I/O.  Typically, such a job involves copying the input files to the scratch space, running the application on scratch and copying the results to the submission directory.  This usage is discussed in a separate section.

 


PBS array jobs

Arrays are a feature of PBS which allows you to submit a series of jobs using a single submission command described by a single submission script.  A typical use of this is the need to batch process a large number of very similar jobs, which have similar input and output.

A job array is a single job with a list of sub-jobs.  To submit an array job, use the -J flag to describe a range of sub-job indices.  For example

qsub -J 1-100 script.sh

submits a job array whose sub-jobs are indexed from 1 to 100.  Also,

qsub -J 100-200 script.sh

submits a job array whose sub-jobs are indexed from 100 to 200.  Furthermore,

qsub -J 100-200:2 script.sh

submits a job array whose sub-jobs are indexed from 100 to 200 with a step of 2, i.e. the indices are 100, 102, 104, etc.

The typical submission script for a job array uses the index of each sub-job to define the task specific for each sub-job, e.g. the name of the input file or of the output directory.  The sub-job index is given by the PBS variable PBS_ARRAY_INDEX.  To illustrate its use, consider the application myApp processes some files named input_*.dat (taken as input), with * ranging from 1 to 100.  This processing is described in a single submission script called submit.sh, which contains the following line

myApp < input_$PBS_ARRAY_INDEX.dat > output_$PBS_ARRAY_INDEX.dat

A job array is submitted using this script, with the command qsub -J 1-100 script.sh.  When a sub-job is executed, the file names in the line above are expanded using the sub-job index, with the result that each sub-job processes a unique input file and outputs the result to a unique output file.

 


PBS jobs using scratch disk space

On arcus (not arcus-b) at present, the use of scratch space (pointed to by the variable TMPDIR) does not offer any performance advantages over the disk space pointed to by $DATA.  Users are then advised to avoid using the scratch space on the ARC resources.  We have plans for infrastructure upgrade, in which performant storage can be used as fast access scratch disk space from within jobs.

 


PBS jobs with conditional execution

It is possible to start a job on the condition that another one completes beforehand; this may be necessary for instance if the input to one job is generated by another job. Job dependency is defined in PBS using the -W flag.

To illustrate with an example, suppose you need to start a job using the script second_job.sh after another job finished successfully. Assume the first job is started using script first_job.sh and the command to start the first job

qsub first_job.sh

returns the job ID 7777. Then, the command to start the second job is

qsub -W depend=afterok:7777 second_job.sh

This job dependency can be further automated (possibly to be included in a bash script) using environment variables:

JOB_ID_1=`qsub first_job.sh`
JOB_ID-2=`qsub -W depend=afterok:$JOB_ID_1 second_job.sh`