This guide is intended to give an overview of what is needed to compile and run MPI software on the ARC cluster systems.
The guide shows how to
- compile a MPI application,
- prepare a job submission script and
- submit the job.
MPI stands for Message Passing Interface, an interface standard that defines a number of library routines aimed at the programming of message-passing (distributed-processing) applications. The interface specifications were designed by a group of researchers from both academia and industry and cover bindings for C, C++ and Fortran.
Being standardised, MPI programming leads to highly portable code. Nevertheless, the MPI standard has many implementations in libraries (both commercial and open source software), and the quality and performance of MPI libraries can differ significantly.
Any MPI library implementation has a number of tools that help programmers build and run MPI applications. The main tools are:
- compiler utilities and
- an application run agent.
Compiler utilities (mpicc, mpicxx, mpif77, mpif90) are used to compile and link MPI programs. These are not compilers as such but wrappers around back-end compilers (e.g. the GNU or Intel compilers) and are designed to make compiling and linking against the MPI library easy.
The run agent launches and manages the execution of a MPI executable on distributed computer systems. This agent is called mpirun or mpiexec, with mpirun being the most frequently used one. Some implementations provide both (e.g. OpenMPI) while others replaced mpirun in the implementation of the MPI-1 standard (e.g. MVAPICH) with mpiexec in the implementations of MPI-2 (e.g. MVAPICH2).
MPI on the ARC systems
The ARC cluster, arcus-b has a mixture of MPI implementations installed and this guide is intended to be independent of any particular flavour of MPI. The MPI libraries available per cluster system are presented below.
The MPI implementations OpenMPI, MVAPICH and Intel-MPI are installed on the clusters arcus-b and arcus-htc, optimised and configured to use the InfiniBand interconnect. Each MPI implementations has several versions installed, each with different compilers. All installations are managed through the modules.
Preparing and Running An Example
Log in to one of the ARC clusters, create a directory in which to do some work and go to it. The sequence of commands is:
Then, copy the ARC MPI examples to your newly created directory
cp /system/software/examples/mpi/* .
Run the command ls to list the copied files. Simple C (cluster_myprog.c) and Fortran (cluster_myprog.f) MPI example codes are provided. Also, there are two submission scripts: cluster_myfirst__torque.sh and cluster_myfirst__slurm.sh. Edit and adapt the submission script for Torque or Slurm, as appropriate for the cluster on which you are running the example.
Compiling the application
The compilation and linking of an MPI program is managed by the compiler wrappers mpicc and mpif77 but performed by a back-end compiler.
arcus-b has no default MPI software stack, mpirun and mpicc are not in the default path and to use any of the MPI library installations on arcus-b, the appropriate module has to be loaded. The module of any of the MPI installations is set to load the module for the back-end compiler that was used to configure and compile it. For instance, loading the module mvapich2/2.0.1__intel-2015, automatically loads the module for MVAPICH2 version 2.0.1 as well as the module for the Intel compilers version 2015. The MVAPICH2 module mvapich2/2.0.1__intel-2015 is the recommended MPI library to use on arcus-b.
After loading the necessary module, compile one of the source files. Regardless the particular MPI library used, the command to compile and link the C source is
mpicc cluster_myprog.c -o cluster_myprog
while for the Fortran, it is
mpif77 cluster_myprog.f -o cluster_myprog
Run ls to verify the executable cluster_myprog was created.
Preparing the submission script
Edit the submission script provided (cluster_myfirst__torque.sh or cluster_myfirst.sh) to input the details of the job. The key lines to pay attention to in the script are
- the request for resources (number of nodes and walltime) and
- the mpirun command.
These details are discussed below for each cluster system.
Slurm is the job scheduler on the IBM cluster Arcus-b, so the submission script should look like this
mpirun $MPI_HOSTS ./cluster_myprog
In this example, Slurm is instructed to allocate 2 nodes (nodes=2) for 10 minutes (walltime=00:10:00). Also, the run is scheduled for 16 MPI processes per node (ppn=16); this maps each MPI process to a physical core, leading to a (generally) optimal run configuration.
There are 16 physical cores per node but this is doubled through the Intel hyper-threading technology to 32 virtual cores (more in this FAQ entry). Some applications (but not all!) may benefit in performance from this technology; users are advised to experiment with the application to determine if there is a benefit or not. If the application benefits from hyper-threading, 32 processes or threads per nodes should be used (ppn=32) but if there is none only 16 should be used (ppn=16).
The command line mpirun $MPI_HOSTS ./cluster_myprog runs the executable cluster_myprog built with the MVAPICH2 library. The environment variable MPI_HOSTS is defined in the shell script enable_arcus-b_mpi.sh, which is sourced just before the mpirun line in the submission script. MPI_HOSTS hides the details of the distributed run from the user and contains information about the number of MPI processes to run, the file listing the hosts (cluster nodes) on which to run, etc.
In fact, enable_arcus-b_mpi.sh supports MVAPICH2 as well as OpenMPI and Intel-MPI. The flag -psm is added automatically to $MPI_HOSTS.
Running the application
After having prepared the submission script, submit the job with
This will print a job number and return control to the Linux prompt at once. Monitor its execution using the qstat (Torque) or squeue (Slurm) command.
Checking the results
After the job is run, you should have two email notifications (one for the start of the job, one for its end) and a couple of extra files in your directory. Slurm scheduler will create a single output file, slurm-XXXX.out.
The output file myprog.oXXXX or slurm-XXXX.out should contain the output from the execution, which can be seen by doing for example
The output should look like this (remark the exact execution of processes is out of order):
Process 2 received from process 1
Process 9 received from process 4
Process 1 received from process 0
Process 15 received from process 14
Process 11 received from process 10
Process 13 received from process 12
Process 4 received from process 3
Process 6 received from process 5
Process 12 received from process 11
Process 10 received from process 9
Process 7 received from process 6
Process 8 received from process 7
Process 0 received from process 16
Process 2 received from process 1
Process 3 received from process 2
Process 5 received from process 4
Process 14 received from process 13
Running a threaded MPI application
$ mpirun $MPI_HOSTS -x IPATH_NO_CPUAFFINITY -x OMP_NUM_THREADS <MPI-OMP.exe>
Finally, versions higher than 1.8.0 in OpenMPI bind automatically processes to threads. Thus,
to prevent all the threads to be trapped into the same processor, and make use of the --bind-to <item> flag to pin processors onto sockets or other physical sub-spaces of the compute node. See the mpirun man page for more details.