What do I need to know once I get an account?
Once you have received an email confirming your account on ARC has been created, you are ready to start using the service. It is advisable at this stage that you spend some time and browse through our support pages, which provide a lot of information on how to use the ARC systems. Below is a number of of questions new users typically have along with the direct links to the answers in the support pages.
Which ARC system should I use?
The ARC operates a number of high performance systems which all do slightly different things dependent on the application at hand; these can range from multiples of individual "embarrassingly parallel" jobs (which may be better suited to out High Throughput Cluster), Machine Learing/Artificial Intelligence applications (which may be better tasked to our GPU nodes) or jobs that harness the computational capabilities of many nodes (hundreds or thousands of cores) in a concerted fashion requiring much inter-processor communication, which could be handled by our "Capability" cluster (ARCUS). An overview of the systems available and node types can be found here.
What basic terminology do I have to be aware of?
Using the ARC service means you will be running software on a computer cluster. The cluster has a few login nodes and many compute nodes. The login nodes are mainly for login access and submission / monitoring of jobs while the compute nodes are for running the jobs themselves. The compute nodes are linked by a specialised Infiniband network, which ensures fast data transfers (high bandwidth and low latency).
The jobs are mapped from a queue to resources by a job scheduler. Each compute node has two CPUs, each of which has a number of cores. When a job runs a parallel application on a compute node, each core can run a process or a thread.
How do I login to my account? How do I copy data to/from ARC?
Please see our ARC user guide.
What software do you support?
A summary list of the most popular software packages and libraries installed is given in the Installed software page. That page also gives a link to the complete list of software installed, which is regularly and automatically generated from the available modules.
There are good chances the application you plan to run is already installed but if not, please contact the ARC staff to ask for it to be installed. As part of the service, we install or upgrade software from the public domain at your request. Also, we test and validate each installation and can advise you on the optimal usage on the ARC systems.
ARC also supports a few commercial applications, which can be used by all users on the ARC systems at no direct cost. Additionally, we support a selection of packages for particular research groups that have purchased commercial licences from grant income, and we restrict access to those packages to users associated with the respective research groups.
If you are interested in running a particular application on ARC that has a commercial licence associated with it, and the application is not already supported for universal use on ARC, we expect you provide the licence as well as a mechanism to use this licence during the use on the ARC systems, for example through a licence server.
How do I run software X?
ARC has written a series of application guides to illustrate the use of some of the most popular packages on the service.
The key to the successful use of the ARC service is parallel processing.
One way to achieve this is to distribute the workload of a single large simulation as a large number of processes working in a concerted way across several compute nodes. This is possible only if the application is designed to do that, which is most frequently achieved via parallel programming using the MPI standard. Many popular scientific packages are in that category because it is the only way to run very large simulations. Running an application written and built with MPI is simple and can be flexibly modified to scale to any number of nodes desired; more about running MPI applications here.
Another way to exploit the parallel nature of the hardware is to runs to a single node and harness all the cores available for processing. This can be done either running several processes at the same time or a single process but with multiple threads of execution. Ideally, this should have support from the application itself, and normally this is controlled via input or command line options. Please refer to the available application documentation to find these options.
Some software have resource discovery capabilities, such as the libraries underlying Matlab, R or python, which can automatically exploit all cores available without user intervention but for a limited set of functionality (dense linear algebra and FFT). But this "magic" is limited and users should not rely on software to automatically use resources optimally. Unfortunately, there is no substitute to understanding how to write job scripts effectively, and it is recommended you consider our training offering. If in doubt about any aspect of running a particular application, contact us.
Warning: software that is not designed to run in a distributed more (usually via MPI) cannot run on more than one node. Asking for more than one node in the execution script leads to wasted resources: the software will only run on the first node allocated, leaving the other(s) idle for the duration of the job. For example, running a Matlab, R or python job on two nodes will not be faster than on a single node.
Lastly, the reality of scientific software is that there are many application that are serial, i.e. run as a single-threaded process These can still exploit the availability of multiple cores on a node by "packing" multiple runs in a single job, as described here.
What is the maximum time a job can run for?
The maximum is 120 hours for most users, but this depends in fact on the level of service, please check here.
If the application run is of the distributed type, which most of the time means a MPI application, processing time of a future job can be reduced by distributing the load on more nodes, i.e.. by increasing the number of nodes requested for the job.
If the application can only run on a single node, there is no easy way to reduce the processing time, assuming all the cores available are already effectively.
For all types of workloads, distributed or single nodes, we encourage users to check if the application they use can checkpoint. This allows the processing to be stopped at any time, thus controlling the length of a job, and subsequently restarted as part of another job. This is by far the preferred way to keep the job runtime under the limit.
Although 120 hours is generous for the standards of HPC data centres, it may be not enough for long runs of software that is not HPC-ready. Recognising the value of your research, we can help by extending wall time of jobs on request, provided that checkpointing is not an option.
Is there training?
We would also encourage you to sign up to our Training courses, which are on a range of subjects including an introduction to using the ARC facility.