Ampere GPU Nodes
================

*Also known as Wilkes3.*

*These new nodes entered general service in October 2021 and were expanded in July 2023.*


Hardware
--------
The Ampere (A100) nodes are:

 - 90 Dell PowerEdge XE8545 servers

each consisting of:

 - 2x AMD EPYC 7763 64-Core Processor 1.8GHz (128 cores in total)
 - 1000 GiB RAM
 - 4x NVIDIA A100-SXM-80GB GPUs
 - Dual-rail Mellanox HDR200 InfiniBand interconnect

and each A100 GPU contains 6912 FP32 CUDA Cores.


Software
--------
The A100 nodes run `Rocky Linux 8`_, which is a rebuild of Red Hat
Enterprise Linux 8 (RHEL8). This means that for best results
you are strongly recommended to rebuild your software on these nodes
rather than try to run binaries previously compiled on CSD3.

The nodes are named according to the scheme *gpu-q-[1-90]*.

In order to obtain an interactive node, request it using *sintr*::

    sintr -t 4:0:0 --exclusive -A YOURPROJECT-GPU -p ampere

.. _`Rocky Linux 8`: https://rockylinux.org/
.. _CentOS7: https://www.centos.org/


Slurm partition
---------------

* The A100 (gpu-q) nodes are in a new **ampere** Slurm partition. Your
  existing -GPU projects will be able to submit jobs to this.

* The gpu-q nodes have **128 cpus** (1 cpu = 1 core), and 1000 GiB of RAM. This
  means that Slurm will allocate **32 cpus per GPU**.

* The gpu-q nodes are interconnected by HDR2 Infiniband. The currently recommended
  MPI library is loaded as a module by default when shells on these nodes are initialized - please see `Jobs requiring MPI`_ 
  for more information.


Recommendations for running on ampere
-------------------------------------
Since the gpu-q nodes are running Rocky8, you will want to recompile
your applications. We suggest you do this by requesting an
`interactive node`__.

The per-job wallclock time limits are 36 hours and 12 hours for SL1/2
and SL3 respectively.

The per-job, per-user GPU limits are currently 64 and 32 GPUs for SL1/2
and SL3 respectively.

These limits should be regarded as provisional and may be revised.

.. __: `Software`_

Default submission script for ampere
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You should find a symbolic link to an Ampere default job submission script modified for the ampere nodes in your home directory, called::

 slurm_submit.wilkes3

This is set up for non-MPI jobs, but can be modified for other types
of job. If you prefer to modify your existing job scripts, please see
the following sections for guidance.


Jobs requiring N GPUs where N < 4
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Although there are 4 GPUs in each node it is possible to request fewer than this, e.g. to request 3 GPUs use::

 #SBATCH --nodes=1
 #SBATCH --gres=gpu:3
 #SBATCH -p ampere

Slurm will enforce allocation of a proportional number of CPUs (32) per GPU.

Note that if you either do not specify a number of GPUs per node with *--gres*, or request more than one node
with less than 4 GPUs per node, you will receive an error on submission.


Jobs requiring multiple nodes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Multi-node jobs need to request either exclusive access to the nodes, or 4 GPUs per node, i.e.::

 #SBATCH --exclusive

or ::

 #SBATCH --gres=gpu:4


Jobs requiring MPI
^^^^^^^^^^^^^^^^^^
We currently recommend using *the version of OpenMPI loaded by default
on the A100 nodes*, which is a version of OpenMPI configured for these
nodes.  If you wish to recompile or test against this new environment,
we recommend requesting an `interactive node`__.

For reference, the default environment on the A100 (gpu-q) nodes is provided by loading a module as follows::

 module purge
 module load rhel8/default-amp

However since the CPU type on gpu-q is different from any other on the
cluster, it is not recommended to build software intended to run on gpu-q on a
different flavour of node.

Performance considerations for MPI jobs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On systems with multiple GPUs and multiple NICs, such as Wilkes3 with 2x HDR NICs and 4x A100 GPUs per node, care should be taken to ensure that GPUs communicate with the closest NIC, to ensure maximum GPU-NIC throughput. Furthermore, each GPU should be assigned to its closest set of CPU cores (NUMA domain). This can be achieved by querying the topology of the machine you are running on (using nvidia-smi topo -m), and then instrumenting your MPI and/or run script to ensure correct placement. On Wilkes3, each pair of GPUs shares a NIC, so we need to ensure that the local NIC to each pair is used for all non-peer-to-peer communication.

An example binding script for doing this with OpenMPI - which is the default MPI module for the Ampere nodes - is the following::

  #!/bin/bash

  EXE=$1
  ARGS=$2
  APP="$EXE $ARGS"

  # This is the list of GPUs we have
  GPUS=(0 1 2 3)

  # This is the list of NICs we should use for each GPU
  # e.g., associate GPUs 0,1 with MLX0, GPUs 2,3 with MLX1
  NICS=(mlx5_0:1 mlx5_0:1 mlx5_1:1 mlx5_1:1)

  # This is the list of CPU cores we should use for each GPU
  # On the Ampere nodes we have 2x64 core CPUs, each organised into 4 NUMA domains
  # We will use only a subset of the available NUMA domains, i.e. 1 NUMA domain per GPU
  # The NUMA domain closest to each GPU can be extracted from nvidia-smi
  CPUS=(48-63 16-31 112-127 80-95)

  # This is the list of memory domains we should use for each GPU
  MEMS=(3 1 7 5)

  # Number of physical CPU cores per GPU (optional)
  export OMP_NUM_THREADS=16

  lrank=$OMPI_COMM_WORLD_LOCAL_RANK

  export CUDA_VISIBLE_DEVICES=${GPUS[${lrank}]}
  export UCX_NET_DEVICES=${NICS[${lrank}]}
  numactl --physcpubind=${CPUS[${lrank}]} --membind=${MEMS[${lrank}]} $APP

Given the above binding script (assume it's name is run.sh), the corresponding MPI launch command can be modified to::

  mpirun -npernode $mpi_tasks_per_node -np $np --bind-to none ./run.sh $application $options

Note that this approach requires exclusive access to a node.

.. __: `Software`_