Using R on ARC

Introduction

R is a powerful free software environment for statistical computing.  It provides access to a wide variety of statistical methods (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) as well as to graphical techniques.  R is highly extensible and a large number of additional libraries are available.  A number of libraries are already installed on the ARC systems and more can be installed on request.

R has fairly frequent releases and we try to keep up to date.  Check the versions available using module spider R.  To avoid incompatibilities, each new R installation uses its own installation of the supported libraries.

On the ARC systems, R should be run non-interactively, in "batch" mode.

The steps to run a R job are:

  • load the R module;
  • create a submission script;
  • submit the job.
 
To use R on the ARC systems you simply need to load the latest R environment module,  in this example version 4.0.2 - it can be loaded using the following command:
 
module load R/4.0.2-foss-2020a
 
To see all versions for R, use :
 
module spider R 
The base install has many popular R packages installed. It is possible that you will need access to packages which are not installed in the central repository. You can install R libraries in an R library repository within your storage area (e.g. $HOME or $DATA) please see below.
 
Please note: Some R libraries depend on the existence of non-R applications or other shared binaries. Attempting to install an R library with binary dependencies may fail. In this case please contact the ARC team and we will install the dependencies for you centrally. 
 

Installing packages into your own R library

As this is an interactive process which may involve building softare, it needs to be performed on an interactive node, so first start an interactive session:
 
srun -p interactive --pty /bin/bash
 
In order to use your own R library repository, you need to define an environment variable named "R_LIBS" containing the path to your local packages (this will need to be available each time you intend to use your local library, so you may wish to place it in your $HOME/.bash_profile file) :
 
 
export R_LIBS=~/local/rlibs

 

Please note: If you do not place the above line in your $HOME/.bash_profile file, you will need to ensure you include it in your submission scripts in order for R to find your locally installed libraries.

 

You can then create this folder (Note: this only needs to be done once):
 


mkdir -p ~/local/rlibs


 

Once this is done you can start R and run the install.packages command to install into this local library repository - or indeed follow the instructions given for a particular package. As an example to install the latest devtools package:

 

[user@arc-c001]$ R

R version 4.0.2 (2020-06-22) -- "Taking Off Again"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages("devtools", lib="~/local/rlibs")

 

For packages such as BiocManager you will need to add the lib location to both the BiocManager installation and subsequent BiocManager::install commands for example:

 

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager", lib="~/local/rlibs")

BiocManager::install("dada2", version = "3.11", lib="~/local/rlibs")
 
 

You may find you need to use the http URL protocol rather than https for some repositories.

Important note: As some R libraries require compilation in order to be installed, it is worth noting that the interactive nodes share the same CPU architecture as the majority of ARC/HTC compute nodes but not the login nodes. This allows users to optimise their compiled code to make use of the most recent CPU features. However, if you attempt to load a library built on the interactive nodes from the login nodes or older compute nodes you may see execution errors such as "Illegal instruction" or "Illegal Operand" - these errors are simply warning you that the CPU cannot understand the more recent instructions the compiler has generated. 

To mitigate the above issue:

1) If you want to test the library interactively, please ensure you use an interactive node - not a login node for this purpose.

2) If you submit a batch job, ensure that you specify:

#SBATCH --constraint='cpu_gen:Cascade_Lake'

The above will ensure your job runs on a node with the same architecture as the interactive node you used to build the library.

Running a simple R job

The simplest submission script for a R job looks like this: 

#!/bin/bash 
#SBATCH --nodes=1 
#SBATCH --time=01:00:00 
#SBATCH --job-name=myRtest 

module purge 
module load R/4.0.2-foss-2020a 

Rscript Rtest.r

The file Rtest.r contains all the R commands needed to run this job.  R is invoked in "batch" mode, with R reading the file Rtest.sh and executing all the commands contained in that script.

Notice the use of a single node (nodes=1); R cannot on its own use more than one node.

Supposing script listed above is called submit.sh, the R job is sent to the execution queue with the SLURM command sbatch:

sbatch submit.sh

Running R on parallel machines

R has mainly been developed as a serial application and the exploitation of multiple cores (available on modern computers) or of multiple compute nodes (available on a cluster) in a single R job is managed through libraries.

Many such libraries are available and the best place to learn about them is the CRAN page on high performance computing using R, which focuses on the parallel processing aspects.  Below, only a few of these options are discussed, and only briefly.

Using the multicore library (under construction)

[multicore]

Using the Rmpi library (under construction)

[rmpi]

Other R-parallel options (under construction)

  • Snow: works well in a traditional cluster environment
  • Multicore: popular for multiprocessor and multicore computers
  • Parallel: part of the upcoming R 2.14.0 release
  • R+Hadoop: provides low-level access to a popular form of cluster computing
  • RHIPE: uses Hadoop’s power with R’s language and interactive shell
  • Segue: lets you use Elastic MapReduce as a backend for lapply-style operations

Bibliography

Q. Ethan McCallum, Stephen Weston, Parallel R, Data Analysis in the Distributed World, O'Reilly Media, 2011

 


    Non-interactive R graphs

    R graphs are normally created, manipulated and saved interactively. Non-interactive R graphs within the jobs can be created and manipulated in the normal way but without interactive graphics displayed on a monitor. The graphs can be nevertheless saved as files in the pdf or postscript formats.

    For example, the following commands

    data <- c(1, 3, 6, 4, 9)
    plot(data, type="o", col="blue")
    title(main="some data", col.main="blue", font.main=4)
    dev.copy(device=postscript, 'myplot.ps')
    dev.off()

    produce a graph, saved in the postscript format in the file myplot.ps. The graph does not appear in a graph window in R; rather, it is written straight to the file specified. Note that the plot is written to the file only after the call to dev.off.

    Both pdf and postscript image files are resizable without loss of image quality and can be transformed to other file formats using the linux convert utility. Postscript files are easy to use in documents using MS WordOpen Office or latex.