Ampere GPU Nodes

Also known as Wilkes3.

These new nodes will enter general service in October 2021.

Hardware

The Ampere (A100) nodes are:

  • 80 Dell PowerEdge XE8545 servers

each consisting of:

  • 2x AMD EPYC 7763 64-Core Processor 1.8GHz (128 cores in total)
  • 1000 GiB RAM
  • 4x NVIDIA A100-SXM-80GB GPUs
  • Dual-rail Mellanox HDR200 InfiniBand interconnect

and each A100 GPU contains 6912 FP32 CUDA Cores.

Software

The A100 nodes run CentOS8 (in contrast to the older CSD3 nodes which at the time of writing run CentOS7). This means that for best results you are strongly recommended to rebuild your software on these nodes rather than try to run binaries previously compiled on CSD3.

The nodes are named according to the scheme gpu-q-[1-80].

In order to obtain an interactive node, request it using sintr:

sintr -t 4:0:0 --exclusive -A YOURPROJECT-GPU -p ampere

Slurm partition

  • The A100 (gpu-q) nodes are in a new ampere Slurm partition. Your existing -GPU projects will be able to submit jobs to this.
  • The gpu-q nodes have 128 cpus (1 cpu = 1 core), and 1000 GiB of RAM. This means that Slurm will allocate 32 cpus per GPU.
  • The gpu-q nodes are interconnected by HDR2 Infiniband. The currently recommended MPI library is loaded as a module by default when shells on these nodes are initialized - please see Jobs requiring MPI for more information.

Recommendations for running on ampere

Since the gpu-q nodes are running CentOS8, you will want to recompile your applications. We suggest you do this by requesting an interactive node.

The per-job wallclock time limits are currently unchanged compared to skylake/pascal/knl at 36 hours and 12 hours for SL1/2 and SL3 respectively.

The per-job, per-user GPU limits are currently 64 and 32 GPUs for SL1/2 and SL3 respectively.

These limits should be regarded as provisional and may be revised.

Default submission script for ampere

You should find a template submission script modified for the ampere nodes at:

/usr/local/Cluster-Docs/SLURM/slurm_submit.wilkes3

This is set up for non-MPI jobs, but can be modified for other types of job. If you prefer to modify your existing job scripts, please see the following sections for guidance.

Jobs requiring N GPUs where N < 4

Although there are 4 GPUs in each node it is possible to request fewer than this, e.g. to request 3 GPUs use:

#SBATCH --nodes=1
#SBATCH --gres=gpu:3
#SBATCH -p ampere

Slurm will enforce allocation of a proportional number of CPUs (32) per GPU.

Note that if you either do not specify a number of GPUs per node with –gres, or request more than one node with less than 4 GPUs per node, you will receive an error on submission.

Jobs requiring multiple nodes

Multi-node jobs need to request either exclusive access to the nodes, or 4 GPUs per node, i.e.:

#SBATCH --exclusive

or

#SBATCH --gres=gpu:4

Jobs requiring MPI

We currently recommend using the version of OpenMPI loaded by default on the A100 nodes, which is a version of OpenMPI configured for these nodes. If you wish to recompile or test against this new environment, we recommend requesting an interactive node.

For reference, the default environment on the A100 (gpu-q) nodes is provided by loading a module as follows:

module purge
module load rhel8/default-amp

However since the CPU type on gpu-q is different from any other on the cluster, and the operating system is a later version than elsewhere, it is not recommended to build software intended to run on gpu-q on a different flavour of node.