Application Guide - R

Introduction

R is a powerful free software environment for statistical computing.  It provides access to a wide variety of statistical methods (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) as well as to graphical techniques.  R is highly extensible and a large number of additional libraries are available.  A number of libraries are already installed on the ARC systems and more can be installed on request.

R has fairly frequent releases and we try to keep up to date.  Check the versions available using module spider R.  To avoid incompatibilities, each new R installation uses its own installation of the supported libraries.

On the ARC systems, R should be run non-interactively, in "batch" mode.

The steps to run a R job are:

  • load the R module;
  • create a submission script;
  • submit the job.

 


Running a simple R job

The simplest submission script for a R job looks like this: 

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --time=01:00:00
#SBATCH --job-name=myRtest

module purge
module load R/4.0.2-foss-2020a

Rscript Rtest.r

The file Rtest.r contains all the R commands needed to run this job.  R is invoked in "batch" mode, with R reading the file Rtest.sh and executing all the commands contained in that script.

Notice the use of a single node (nodes=1); R cannot on its own use more than one node.

Supposing script listed above is called submit.sh, the R job is sent to the execution queue with the Torque (left) or Slurm (right) command

sbatch submit.sh

 


Running R on parallel machines

R has mainly been developed as a serial application and the exploitation of multiple cores (available on modern computers) or of multiple compute nodes (available on a cluster) in a single R job is managed through libraries.

Many such libraries are available and the best place to learn about them is the CRAN page on high performance computing using R, which focuses on the parallel processing aspects.  Below, only a few of these options are discussed, and only briefly.

Adding R Libraries to your own DATA area.

Email us for instructions on adding R Libraries to your $DATA

Using the Intel MKL library

For optimal performance, the ARC R installation is built using Intel MKL, a highly optimised set of linear algebra libraries that includes Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) implementations, as well as fast Fourier transforms and vectorised math operations.  Through MKL, R naturally makes use of the multiple cores available on a compute node.

For example, the following R code generates a large matrix and computes its singular value decomposition (SVD):

# define a Hilbert map
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") }

# X is a Hilbert matrix, 9000x6000
X <- hilbert(9000)[,1:6000]

# svd (X)
s <- svd(X)

Running this using the ARC R installation utilises the available cores on a compute node automatically, without any further programming from the users.

Using the multicore library (under construction)

[multicore]

Using the Rmpi library (under construction)

[rmpi]

Other R-parallel options (under construction)

  • Snow: works well in a traditional cluster environment
  • Multicore: popular for multiprocessor and multicore computers
  • Parallel: part of the upcoming R 2.14.0 release
  • R+Hadoop: provides low-level access to a popular form of cluster computing
  • RHIPE: uses Hadoop’s power with R’s language and interactive shell
  • Segue: lets you use Elastic MapReduce as a backend for lapply-style operations

Bibliography

Q. Ethan McCallum, Stephen Weston, Parallel R, Data Analysis in the Distributed World, O'Reilly Media, 2011

 


    Non-interactive R graphs

    R graphs are normally created, manipulated and saved interactively. Non-interactive R graphs within the jobs can be created and manipulated in the normal way but without interactive graphics displayed on a monitor. The graphs can be nevertheless saved as files in the pdf or postscript formats.

    For example, the following commands

    data <- c(1, 3, 6, 4, 9)
    plot(data, type="o", col="blue")
    title(main="some data", col.main="blue", font.main=4)
    dev.copy(device=postscript, 'myplot.ps')
    dev.off()

    produce a graph, saved in the postscript format in the file myplot.ps. The graph does not appear in a graph window in R; rather, it is written straight to the file specified. Note that the plot is written to the file only after the call to dev.off.

    Both pdf and postscript image files are resizable without loss of image quality and can be transformed to other file formats using the linux convert utility. Postscript files are easy to use in documents using MS WordOpen Office or latex.