Ampere GPU Nodes ================ *Also known as Wilkes3.* *These new nodes entered general service in October 2021 and were expanded in July 2023.* Hardware -------- The Ampere (A100) nodes are: - 90 Dell PowerEdge XE8545 servers each consisting of: - 2x AMD EPYC 7763 64-Core Processor 1.8GHz (128 cores in total) - 1000 GiB RAM - 4x NVIDIA A100-SXM-80GB GPUs - Dual-rail Mellanox HDR200 InfiniBand interconnect and each A100 GPU contains 6912 FP32 CUDA Cores. Software -------- The A100 nodes run `Rocky Linux 8`_, which is a rebuild of Red Hat Enterprise Linux 8 (RHEL8). This is in contrast to the older CSD3 nodes which at the time of writing run CentOS7_, which is a rebuild of Red Hat Enterprise Linux 7 (RHEL7). This means that for best results you are strongly recommended to rebuild your software on these nodes rather than try to run binaries previously compiled on CSD3. The nodes are named according to the scheme *gpu-q-[1-90]*. In order to obtain an interactive node, request it using *sintr*:: sintr -t 4:0:0 --exclusive -A YOURPROJECT-GPU -p ampere .. _`Rocky Linux 8`: https://rockylinux.org/ .. _CentOS7: https://www.centos.org/ Slurm partition --------------- * The A100 (gpu-q) nodes are in a new **ampere** Slurm partition. Your existing -GPU projects will be able to submit jobs to this. * The gpu-q nodes have **128 cpus** (1 cpu = 1 core), and 1000 GiB of RAM. This means that Slurm will allocate **32 cpus per GPU**. * The gpu-q nodes are interconnected by HDR2 Infiniband. The currently recommended MPI library is loaded as a module by default when shells on these nodes are initialized - please see `Jobs requiring MPI`_ for more information. Recommendations for running on ampere ------------------------------------- Since the gpu-q nodes are running Rocky8, you will want to recompile your applications. We suggest you do this by requesting an `interactive node`__. The per-job wallclock time limits are 36 hours and 12 hours for SL1/2 and SL3 respectively. The per-job, per-user GPU limits are currently 64 and 32 GPUs for SL1/2 and SL3 respectively. These limits should be regarded as provisional and may be revised. .. __: `Software`_ Default submission script for ampere ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You should find a template submission script modified for the ampere nodes at:: /usr/local/Cluster-Docs/SLURM/slurm_submit.wilkes3 This is set up for non-MPI jobs, but can be modified for other types of job. If you prefer to modify your existing job scripts, please see the following sections for guidance. Jobs requiring N GPUs where N < 4 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Although there are 4 GPUs in each node it is possible to request fewer than this, e.g. to request 3 GPUs use:: #SBATCH --nodes=1 #SBATCH --gres=gpu:3 #SBATCH -p ampere Slurm will enforce allocation of a proportional number of CPUs (32) per GPU. Note that if you either do not specify a number of GPUs per node with *--gres*, or request more than one node with less than 4 GPUs per node, you will receive an error on submission. Jobs requiring multiple nodes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Multi-node jobs need to request either exclusive access to the nodes, or 4 GPUs per node, i.e.:: #SBATCH --exclusive or :: #SBATCH --gres=gpu:4 Jobs requiring MPI ^^^^^^^^^^^^^^^^^^ We currently recommend using *the version of OpenMPI loaded by default on the A100 nodes*, which is a version of OpenMPI configured for these nodes. If you wish to recompile or test against this new environment, we recommend requesting an `interactive node`__. For reference, the default environment on the A100 (gpu-q) nodes is provided by loading a module as follows:: module purge module load rhel8/default-amp However since the CPU type on gpu-q is different from any other on the cluster, and the operating system is a later version than elsewhere, it is not recommended to build software intended to run on gpu-q on a different flavour of node. Performance considerations for MPI jobs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ On systems with multiple GPUs and multiple NICs, such as Wilkes3 with 2x HDR NICs and 4x A100 GPUs per node, care should be taken to ensure that GPUs communicate with the closest NIC, to ensure maximum GPU-NIC throughput. Furthermore, each GPU should be assigned to its closest set of CPU cores (NUMA domain). This can be achieved by querying the topology of the machine you are running on (using nvidia-smi topo -m), and then instrumenting your MPI and/or run script to ensure correct placement. On Wilkes3, each pair of GPUs shares a NIC, so we need to ensure that the local NIC to each pair is used for all non-peer-to-peer communication. An example binding script for doing this with OpenMPI - which is the default MPI module for the Ampere nodes - is the following:: #!/bin/bash EXE=$1 ARGS=$2 APP="$EXE $ARGS" # This is the list of GPUs we have GPUS=(0 1 2 3) # This is the list of NICs we should use for each GPU # e.g., associate GPUs 0,1 with MLX0, GPUs 2,3 with MLX1 NICS=(mlx5_0:1 mlx5_0:1 mlx5_1:1 mlx5_1:1) # This is the list of CPU cores we should use for each GPU # On the Ampere nodes we have 2x64 core CPUs, each organised into 4 NUMA domains # We will use only a subset of the available NUMA domains, i.e. 1 NUMA domain per GPU # The NUMA domain closest to each GPU can be extracted from nvidia-smi CPUS=(48-63 16-31 112-127 80-95) # This is the list of memory domains we should use for each GPU MEMS=(3 1 7 5) # Number of physical CPU cores per GPU (optional) export OMP_NUM_THREADS=16 lrank=$OMPI_COMM_WORLD_LOCAL_RANK export CUDA_VISIBLE_DEVICES=${GPUS[${lrank}]} export UCX_NET_DEVICES=${NICS[${lrank}]} numactl --physcpubind=${CPUS[${lrank}]} --membind=${MEMS[${lrank}]} $APP Given the above binding script (assume it's name is run.sh), the corresponding MPI launch command can be modified to:: mpirun -npernode $mpi_tasks_per_node -np $np --bind-to none ./run.sh $application $options Note that this approach requires exclusive access to a node. .. __: `Software`_