Compiling and Running MPI Software
Introduction
This guide is intended to give an overview of what is needed to compile and run MPI software on the ARC cluster systems.
The guide shows how to:
Compile a MPI application,
Prepare a job submission script and
Submit the job.
About MPI
MPI stands for Message Passing Interface, an interface standard that defines a number of library routines aimed at the programming of message-passing (distributed-processing) applications. The interface specifications were designed by a group of researchers from both academia and industry and cover bindings for C, C++ and Fortran.
Being standardised, MPI programming leads to highly portable code. Nevertheless, the MPI standard has many implementations in libraries (both commercial and open source software), and the quality and performance of MPI libraries can differ significantly.
Any MPI library implementation has a number of tools that help programmers build and run MPI applications. The main tools are:
compiler utilities and an application run agent.
Compiler utilities (mpicc
, mpiicc
, mpicxx
, mpif77
, mpif90
, mpiifort
) are used to compile and link MPI programs.
These are not compilers as such but wrappers around back-end compilers (e.g. the GNU or Intel compilers) and are designed to make compiling
and linking against the MPI library easy.
The run agent launches and manages the execution of a MPI executable on distributed computer systems. This agent is called mpirun or mpiexec, with mpirun being the most frequently used one.
MPI on the ARC systems
The ARC clusters have two main MPI implementations installed, however this guide is intended to be independent of any particular flavour of MPI. The MPI libraries available per cluster system are presented below.
The MPI implementations OpenMPI and Intel-MPI are installed on the clusters ARC and HTC, are optimised and configured to use the InfiniBand interconnect. Each MPI implementation has several versions installed, and may be used with different compilers. All installations are managed through the environment module system.
Preparing and Running An Example
Preparation
Log in to one of the ARC clusters and ensure you are running on an interactive node (this is important!), create a directory in which to do some work and go to it. The sequence of commands is:
srun -p interactive --pty /bin/bash
cd $DATA
mkdir examples
cd examples
Then, copy the ARC MPI example files to your newly created directory:
cp /apps/common/examples/mpi/* .
Run the command ls
to list the copied files. Simple C cluster_myprog.c
and Fortran cluster_myprog.f
MPI example codes are provided.
Also, there is a submission script slurm.sh
You can edit and adapt the submission script for the cluster on which you are running the example.
Compiling the application
The compilation and linking of an MPI program is managed by the compiler wrappers mpicc
and mpif77
for GCC and mpiicc
and mpiifort
for Intel -
and performed by the back-end compiler. The MPI wrapper scripts ensure the correct options for MPI operation are supplied to the compiler.
Toolchains
The ARC and HTC systems have a number of compiler, MPI and maths library combinations grouped into toolchains which are versioned every six months (a and b versions). These are based upon the EasyBuild standard toolchain definitions to ensure reproducability. For Intel compilers these are named intel and for GCC they are named foss (free open-source software).
For example the intel/2020a
toolchain contains the following components:
module load intel/2020a
module list
Currently Loaded Modules:
1) GCCcore/9.3.0 3) binutils/2.34-GCCcore-9.3.0 5) impi/2019.7.217-iccifort-2020.1.217 7) imkl/2020.1.217-iimpi-2020a
2) zlib/1.2.11-GCCcore-9.3.0 4) iccifort/2020.1.217 6) iimpi/2020a 8) intel/2020a
The foss/2020a
toolchain contains:
module load foss/2020a
module list
Currently Loaded Modules:
1) GCCcore/9.3.0 4) GCC/9.3.0 7) libxml2/2.9.10-GCCcore-9.3.0 10) OpenMPI/4.0.3-GCC-9.3.0 13) FFTW/3.3.8-gompi-2020a
2) zlib/1.2.11-GCCcore-9.3.0 5) numactl/2.0.13-GCCcore-9.3.0 8) libpciaccess/0.16-GCCcore-9.3.0 11) OpenBLAS/0.3.9-GCC-9.3.0 14) ScaLAPACK/2.1.0-gompi-2020a
3) binutils/2.34-GCCcore-9.3.0 6) XZ/5.2.5-GCCcore-9.3.0 9) hwloc/2.2.0-GCCcore-9.3.0 12) gompi/2020a 15) foss/2020a
Important Note for Intel toolchain users: When using the intel toolchain, the MPI build wrappers mpicc
, mpicxx
and mpifc
point to the GCC compilers. To
use the Intel compilers you should use the wrappers: mpiicc
, mpiicpc
and mpiifort
respectively. If you are using a third-party build which cannot be
easily modified, you can override the behaviour of the mpicc
, mpicxx
and mpifc
wrappers to use Intel compilers by setting the following environment
variables:
export MPICH_CC=icc
export MPICH_FC=ifort
export MPICH_F90=ifort
export MPICH_F77=ifort
export MPICH_CPP="icc -E"
export MPICH_CXX=icpc
export MPICH_CCC=icpc
Other toolchains/versions can be made available, a list of EasyBuild supported versions can be found here. Please note that the ARC systems only support foss/2018b
and newer, and intel/2020a
and newer - due to operating system compatibility.
Compilation
After loading your chosen toolchain module, compile one of the source files:
For the foss
toolchain use:
mpicc cluster_myprog.c -o cluster_myprog
Or (for the Fortran code):
mpif77 cluster_myprog.f -o cluster_myprog
For the intel
toolchain use:
mpiicc cluster_myprog.c -o cluster_myprog
Or (for the Fortran code):
mpiifort cluster_myprog.f -o cluster_myprog
Run the ls
command to verify the executable cluster_myprog was created.
Preparing the submission script
Edit the submission script provided slurm.sh
to input the details of the job. The key lines to pay attention to in the script are:
the request for resources (number of nodes and walltime)
the chosen toolchain and
the mpirun command.
The submission script should look like this for a foss toolchain build:
#!/bin/bash
#SBATCH --job-name=myprog
#SBATCH --time=00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=my.name@email.com
module load foss/2020a
mpirun ./cluster_myprog
or for an intel
toolchain build:
#!/bin/bash
#SBATCH --job-name=myprog
#SBATCH --time=00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=my.name@email.com
module load intel/2020a
mpirun ./cluster_myprog
In this example, SLURM is instructed to allocate 2 nodes --nodes=2
for 10 minutes --time=00:10:00
Also, the run is scheduled for 8 MPI processes per node; this maps each MPI process to a physical core, leading to a (generally) optimal run configuration.
N.B. In ARC there are 48 cores per node but in this example we are only using 8 cores per node.
The command line mpirun ./cluster_myprog
runs the executable cluster_myprog
built with the approprate toolchain MPI library.
Running the application
After having prepared the submission script, submit the job with:
sbatch slurm.sh
This will print a job number and return control to the Linux prompt at once. Monitor its execution using the SLURM squeue
command.
Checking the results
After the job is run, you should have two email notifications (one for the start of the job, one for its end) and a couple of extra files in your directory. The SLURM scheduler will create a single output file, slurm-XXXX.out. [where XXXX is the JobId number]
The output file slurm-XXXX.out should contain the output from the execution, which can be seen by doing for example:
cat slurm-XXXX.out
The output should look like this (the exact execution of processes is out of order due to the parallelisation):
Process 2 received from process 1
Process 9 received from process 4
Process 1 received from process 0
Process 15 received from process 14
Process 11 received from process 10
Process 13 received from process 12
Process 4 received from process 3
Process 6 received from process 5
Process 12 received from process 11
Process 10 received from process 9
Process 7 received from process 6
Process 8 received from process 7
Process 0 received from process 16
Process 2 received from process 1
Process 3 received from process 2
Process 5 received from process 4
Process 14 received from process 13
MPI Core Allocation (and OpenMP)
In the above examples we have used the SLURM --ntasks-per-node
option to allocate a single CPU core to each MPI process. There may be occasions where we want to run fewer MPI processes per node, and use insead OpenMP for the remaining allocated cores. We can do this using the --cpus-per-task
option.
Below is an example submission script (for OpenMPI) which requests two nodes with 1 MPI process each, where each MPI process can use 8 cores (for OpenMP) - so a total allocation of 16 cores:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=00:10:00
#SBATCH --partition=devel
module load mpitest/1.0
mpirun --map-by numa:pe=${SLURM_CPUS_PER_TASK} mpisize
The command from the mpitest module
, named mpisize outputs the following information:
Hello from host "arc-c303". This is MPI task 1, the total MPI Size is 2, and there are 8 CPU core(s) allocated to *this* MPI task, these being { 0 1 2 3 4 5 6 7 }
Hello from host "arc-c302". This is MPI task 0, the total MPI Size is 2, and there are 8 CPU core(s) allocated to *this* MPI task, these being { 0 1 2 3 4 5 6 7 }
From the results above we can see that as expected, two MPI processes ran, one on node arc-c302
and the other on arc-303
and each of these processes were allocaed 8 CPUs.
Note: The mpirun option --map-by numa:pe=${SLURM_CPUS_PER_TASK}
is not required if running with Intel MPI.