Compiling and Running MPI Software

Introduction

This guide is intended to give an overview of what is needed to compile and run MPI software on the ARC cluster systems.

The guide shows how to

  • compile a MPI application,
  • prepare a job submission script and
  • submit the job.

About MPI

MPI stands for Message Passing Interface, an interface standard that defines a number of library routines aimed at the programming of message-passing (distributed-processing) applications.  The interface specifications were designed by a group of researchers from both academia and industry and cover bindings for C, C++ and Fortran.

Being standardised, MPI programming leads to highly portable code.  Nevertheless, the MPI standard has many implementations in libraries (both commercial and open source software), and the quality and performance of MPI libraries can differ significantly.

Any MPI library implementation has a number of tools that help programmers build and run MPI applications.  The main tools are:

  • compiler utilities and
  • an application run agent.

Compiler utilities (mpiccmpicxxmpif77mpif90) are used to compile and link MPI programs.  These are not compilers as such but wrappers around back-end compilers (e.g. the GNU or Intel compilers) and are designed to make compiling and linking against the MPI library easy.

The run agent launches and manages the execution of a MPI executable on distributed computer systems.  This agent is called mpirun or mpiexec, with mpirun being the most frequently used one.  Some implementations provide both (e.g. OpenMPI) while others replaced mpirun in the implementation of the MPI-1 standard (e.g. MVAPICH) with mpiexec in the implementations of MPI-2 (e.g. MVAPICH2).

    MPI on the ARC systems

    The ARC cluster, arcus-b has a mixture of MPI implementations installed and this guide is intended to be independent of any particular flavour of MPI.  The MPI libraries available per cluster system are presented below.

    The MPI implementations OpenMPIMVAPICH and Intel-MPI are installed on the clusters  arcus-b and arcus-htc, optimised and configured to use the InfiniBand interconnect.  Each MPI implementations has several versions installed, each with different compilers.  All installations are managed through the modules.

     

    Preparing and Running An Example

    Preparation

    Log in to one of the ARC clusters, create a directory in which to do some work and go to it.  The sequence of commands is:

    cd $DATA
    mkdir examples
    cd examples

     

    Then, copy the ARC MPI examples to your newly created directory

    cp /system/software/examples/mpi/* .

    Run the command ls to list the copied files.  Simple C (cluster_myprog.c) and Fortran (cluster_myprog.f) MPI example codes are provided.  Also, there are two submission scripts: cluster_myfirst__torque.sh and cluster_myfirst__slurm.sh.  Edit and adapt the submission script for Torque or Slurm, as appropriate for the cluster on which you are running the example.

    Compiling the application

    The compilation and linking of an MPI program is managed by the compiler wrappers mpicc and mpif77 but performed by a back-end compiler.

    Arcus-b

    arcus-b has no default MPI software stack, mpirun and mpicc are not in the default path and to use any of the MPI library installations on arcus-b, the appropriate module has to be loaded.  The module of any of the MPI installations is set to load the module for the back-end compiler that was used to configure and compile it.  For instance, loading the module mvapich2/2.0.1__intel-2015, automatically loads the module for MVAPICH2 version 2.0.1 as well as the module for the Intel compilers version 2015.  The MVAPICH2 module mvapich2/2.0.1__intel-2015 is the recommended MPI library to use on arcus-b.

    Compilation

    After loading the necessary module, compile one of the source files.  Regardless the particular MPI library used, the command to compile and link the C source is

    mpicc cluster_myprog.c -o cluster_myprog

     

    while for the Fortran, it is

    mpif77 cluster_myprog.f -o cluster_myprog

     

    Run ls to verify the executable cluster_myprog was created.

    Preparing the submission script

    Edit the submission script provided (cluster_myfirst__torque.sh or cluster_myfirst.sh) to input the details of the job.  The key lines to pay attention to in the script are

    • the request for resources (number of nodes and walltime) and
    • the mpirun command.

    These details are discussed below for each cluster system.

    Arcus-b

    Slurm is the job scheduler on the IBM cluster Arcus-b, so the submission script should look like this

    #!/bin/bash

    #SBATCH --job-name=myprog
    #SBATCH --time=00:10:00
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=16
    #SBATCH --mail-type=BEGIN,END
    #SBATCH --mail-user=my.name@email.com

    . enable_arcus-b_mpi.sh

    mpirun $MPI_HOSTS ./cluster_myprog

     

    In this example, Slurm is instructed to allocate 2 nodes (nodes=2) for 10 minutes (walltime=00:10:00).  Also, the run is scheduled for 16 MPI processes per node (ppn=16); this maps each MPI process to a physical core, leading to a (generally) optimal run configuration.

    There are 16 physical cores per node but this is doubled through the Intel hyper-threading technology to 32 virtual cores (more in this FAQ entry).  Some applications (but not all!) may benefit in performance from this technology; users are advised to experiment with the application to determine if there is a benefit or not.  If the application benefits from hyper-threading, 32 processes or threads per nodes should be used (ppn=32) but if there is none only 16 should be used (ppn=16).

    The command line mpirun $MPI_HOSTS ./cluster_myprog runs the executable cluster_myprog built with the MVAPICH2 library.  The environment variable MPI_HOSTS is defined in the shell script enable_arcus-b_mpi.sh, which is sourced just before the mpirun line in the submission script.  MPI_HOSTS hides the details of the distributed run from the user and contains information about the number of MPI processes to run, the file listing the hosts (cluster nodes) on which to run, etc.

    In fact, enable_arcus-b_mpi.sh supports MVAPICH2 as well as OpenMPI and Intel-MPI.  The flag -psm is added automatically to $MPI_HOSTS.

    Running the application

    After having prepared the submission script, submit the job with

    sbatch cluster_myfirst__slurm.sh

    This will print a job number and return control to the Linux prompt at once.  Monitor its execution using the qstat (Torque) or squeue (Slurm) command.

    Checking the results

    After the job is run, you should have two email notifications (one for the start of the job, one for its end) and a couple of extra files in your directory.  Slurm scheduler will create a single output file, slurm-XXXX.out.

    The output file myprog.oXXXX or slurm-XXXX.out should contain the output from the execution, which can be seen by doing for example

    cat myprog.oXXXX

     

    The output should look like this (remark the exact execution of processes is out of order):

    Process  2  received  from process  1
    Process  9  received  from process  4
    Process  1  received  from process  0
    Process  15 received  from process  14
    Process  11 received  from process  10
    Process  13 received  from process  12
    Process  4  received  from process  3
    Process  6  received  from process  5
    Process  12 received  from process  11
    Process  10 received  from process  9
    Process  7  received  from process  6
    Process  8  received  from process  7
    Process  0  received  from process  16
    Process  2  received  from process  1
    Process  3  received  from process  2
    Process  5  received  from process  4
    Process  14 received  from process  13

     

    Hybrid MPI+OpenMP

    Running a threaded MPI application

    $ mpirun $MPI_HOSTS -x IPATH_NO_CPUAFFINITY -x OMP_NUM_THREADS <MPI-OMP.exe> 

     

    Finally, versions higher than 1.8.0 in OpenMPI bind automatically processes to threads. Thus,

    export OMPI_MCA_hwloc_base_binding_policy=none

    to prevent all the threads to be trapped into the same processor, and make use of the --bind-to <item> flag to pin processors onto sockets or other physical sub-spaces of the compute node. See the mpirun man page for more details.