Ampere GPU Nodes

Also known as Wilkes3.

These new nodes entered general service in October 2021 and were expanded in July 2023.

Hardware

The Ampere (A100) nodes are:

  • 90 Dell PowerEdge XE8545 servers

each consisting of:

  • 2x AMD EPYC 7763 64-Core Processor 1.8GHz (128 cores in total)
  • 1000 GiB RAM
  • 4x NVIDIA A100-SXM-80GB GPUs
  • Dual-rail Mellanox HDR200 InfiniBand interconnect

and each A100 GPU contains 6912 FP32 CUDA Cores.

Software

The A100 nodes run Rocky Linux 8, which is a rebuild of Red Hat Enterprise Linux 8 (RHEL8). This is in contrast to the older CSD3 nodes which at the time of writing run CentOS7, which is a rebuild of Red Hat Enterprise Linux 7 (RHEL7). This means that for best results you are strongly recommended to rebuild your software on these nodes rather than try to run binaries previously compiled on CSD3.

The nodes are named according to the scheme gpu-q-[1-90].

In order to obtain an interactive node, request it using sintr:

sintr -t 4:0:0 --exclusive -A YOURPROJECT-GPU -p ampere

Slurm partition

  • The A100 (gpu-q) nodes are in a new ampere Slurm partition. Your existing -GPU projects will be able to submit jobs to this.
  • The gpu-q nodes have 128 cpus (1 cpu = 1 core), and 1000 GiB of RAM. This means that Slurm will allocate 32 cpus per GPU.
  • The gpu-q nodes are interconnected by HDR2 Infiniband. The currently recommended MPI library is loaded as a module by default when shells on these nodes are initialized - please see Jobs requiring MPI for more information.

Recommendations for running on ampere

Since the gpu-q nodes are running Rocky8, you will want to recompile your applications. We suggest you do this by requesting an interactive node.

The per-job wallclock time limits are 36 hours and 12 hours for SL1/2 and SL3 respectively.

The per-job, per-user GPU limits are currently 64 and 32 GPUs for SL1/2 and SL3 respectively.

These limits should be regarded as provisional and may be revised.

Default submission script for ampere

You should find a template submission script modified for the ampere nodes at:

/usr/local/Cluster-Docs/SLURM/slurm_submit.wilkes3

This is set up for non-MPI jobs, but can be modified for other types of job. If you prefer to modify your existing job scripts, please see the following sections for guidance.

Jobs requiring N GPUs where N < 4

Although there are 4 GPUs in each node it is possible to request fewer than this, e.g. to request 3 GPUs use:

#SBATCH --nodes=1
#SBATCH --gres=gpu:3
#SBATCH -p ampere

Slurm will enforce allocation of a proportional number of CPUs (32) per GPU.

Note that if you either do not specify a number of GPUs per node with –gres, or request more than one node with less than 4 GPUs per node, you will receive an error on submission.

Jobs requiring multiple nodes

Multi-node jobs need to request either exclusive access to the nodes, or 4 GPUs per node, i.e.:

#SBATCH --exclusive

or

#SBATCH --gres=gpu:4

Jobs requiring MPI

We currently recommend using the version of OpenMPI loaded by default on the A100 nodes, which is a version of OpenMPI configured for these nodes. If you wish to recompile or test against this new environment, we recommend requesting an interactive node.

For reference, the default environment on the A100 (gpu-q) nodes is provided by loading a module as follows:

module purge
module load rhel8/default-amp

However since the CPU type on gpu-q is different from any other on the cluster, and the operating system is a later version than elsewhere, it is not recommended to build software intended to run on gpu-q on a different flavour of node.

Performance considerations for MPI jobs

On systems with multiple GPUs and multiple NICs, such as Wilkes3 with 2x HDR NICs and 4x A100 GPUs per node, care should be taken to ensure that GPUs communicate with the closest NIC, to ensure maximum GPU-NIC throughput. Furthermore, each GPU should be assigned to its closest set of CPU cores (NUMA domain). This can be achieved by querying the topology of the machine you are running on (using nvidia-smi topo -m), and then instrumenting your MPI and/or run script to ensure correct placement. On Wilkes3, each pair of GPUs shares a NIC, so we need to ensure that the local NIC to each pair is used for all non-peer-to-peer communication.

An example binding script for doing this with OpenMPI - which is the default MPI module for the Ampere nodes - is the following:

#!/bin/bash

EXE=$1
ARGS=$2
APP="$EXE $ARGS"

# This is the list of GPUs we have
GPUS=(0 1 2 3)

# This is the list of NICs we should use for each GPU
# e.g., associate GPUs 0,1 with MLX0, GPUs 2,3 with MLX1
NICS=(mlx5_0:1 mlx5_0:1 mlx5_1:1 mlx5_1:1)

# This is the list of CPU cores we should use for each GPU
# On the Ampere nodes we have 2x64 core CPUs, each organised into 4 NUMA domains
# We will use only a subset of the available NUMA domains, i.e. 1 NUMA domain per GPU
# The NUMA domain closest to each GPU can be extracted from nvidia-smi
CPUS=(48-63 16-31 112-127 80-95)

# This is the list of memory domains we should use for each GPU
MEMS=(3 1 7 5)

# Number of physical CPU cores per GPU (optional)
export OMP_NUM_THREADS=16

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export CUDA_VISIBLE_DEVICES=${GPUS[${lrank}]}
export UCX_NET_DEVICES=${NICS[${lrank}]}
numactl --physcpubind=${CPUS[${lrank}]} --membind=${MEMS[${lrank}]} $APP

Given the above binding script (assume it’s name is run.sh), the corresponding MPI launch command can be modified to:

mpirun -npernode $mpi_tasks_per_node -np $np --bind-to none ./run.sh $application $options

Note that this approach requires exclusive access to a node.