Dawn - Intel GPU (PVC) Nodes¶
These new nodes entered Early Access service in January 2024
Hardware¶
The Dawn (PVC) nodes are:
- 256 Dell PowerEdge XE9640 servers
each consisting of:
- 2x Intel(R) Xeon(R) Platinum 8468 (formerly codenamed Sapphire Rapids) (96 cores in total)
- 1024 GiB RAM
- 4x Intel(R) Data Center GPU Max 1550 GPUs (formerly codenamed Ponte Vecchio) (128 GiB GPU RAM each)
- Xe-Link 4-way GPU interconnect within the node
- Quad-rail NVIDIA (Mellanox) HDR200 InfiniBand interconnect
and each PVC GPU contains two stacks (previously known as tiles) and 1024 compute units.
Software¶
At the time of writing, we recommend logging in initially to the CSD3 login-icelake nodes (login-icelake.hpc.cam.ac.uk). To ensure your environment is clean and set up correctly for Dawn, please purge your modules and load the base Dawn environment:
module purge
module load default-dawn
The PVC nodes run Rocky Linux 8, which is a rebuild of Red Hat Enterprise Linux 8 (RHEL8). The Sapphire Rapids CPUs on these nodes are also more modern and support newer instructions than most other CSD3 partitions.
As we provide a separate set of modules specifically for dawn nodes, in general, we don’t support running software built for other CSD3 partitions on Dawn nodes. Therefore you are strongly recommended to rebuild your software on the Dawn nodes rather than try to run binaries previously compiled on CSD3.
Be aware that the software environment for Dawn is optimised for its hardware and the binaries may fail to run on other CSD3 nodes, including cpu (login-p) and icelake (login-q) login nodes. If you wish to recompile or test against this new environment, we recommend requesting an interactive node with command:
sintr -t 01:00:00 -A YOURPROJECT-DAWN-GPU -p pvc -n 1 -c 24 --gres=gpu:1
The nodes are named according to the scheme pvc-s-[1-256].
Slurm partition¶
The PVC (pvc-s) nodes are in a new pvc Slurm partition.
Dawn Slurm projects follow the CSD3 naming convention for GPU projects and contain units of GPU hours. Additionally Dawn project names follow the pattern NAME-DAWN-GPU.
Recommendations for running on Dawn¶
The resource limits are currently set to a maximum of 64 GPUs per user with a maximum wallclock time of 36 hours per job.
These limits should be regarded as provisional and may be revised.
Default submission script for dawn¶
A template submission script will be provided soon.
To submit a job to the Dawn PVC partition, your batch script should look similar to the following example:
#!/bin/bash -l
#SBATCH --job-name=my-batch-job
#SBATCH --account=<your Dawn SLURM account>
#SBATCH --partition=pvc # Dawn PVC partition
#SBATCH -n 4 # Number of tasks (usually number of MPI ranks)
#SBATCH -c 24 # Number of cores per task
#SBATCH --gres=gpu:1 # Number of requested GPUs per node
module purge
module load default-dawn
# Set up environment below for example by loading more modules
srun <your_application>
Jobs requiring N GPUs where N < 4¶
Although there are 4 GPUs in each node it is possible to request fewer than this, e.g. to request 3 GPUs use:
#SBATCH --nodes=1
#SBATCH --gres=gpu:3
#SBATCH -p pvc
Jobs requiring multiple nodes¶
Multi-node jobs need to request 4 GPUs per node, i.e.:
#SBATCH --gres=gpu:4
Jobs requiring MPI¶
We currently recommend using the Intel MPI Library provided by the oneAPI toolkit:
module av intel-oneapi-mpi
To use GPU Aware MPI and allow passing device buffers to MPI calls, set the I_MPI_OFFLOAD
environment variable to 1
in your submission script:
export I_MPI_OFFLOAD=1
If you are sure that your code only involves buffers of the same type (e.g. only GPU buffers or only host buffers) in a single MPI operation you can further optimize MPI communication between GPUs by setting
export I_MPI_OFFLOAD_SYMMETRIC=1
This will disable handling of MPI communication between GPU buffers and host buffers.
Multithreading jobs¶
If your code uses multithreading (e.g. host-based OpenMP), you will need to specify the number of threads per process in your Slurm batch script using the cores-per-task parameter. For example, to run a hybrid MPI-OpenMP application using 24 processes and 4 threads per task:
#SBATCH -n 24 # or --ntasks
#SBATCH -c 4 # or --cores-per-task
If you do _not_ specify the cores-per-task parameter Slurm will pin the threads to the same core, reducing performance.
Recommended Compilers¶
We recommend using the Intel oneAPI compilers for C, C++ and Fortran:
module avail intel-oneapi-compilers
These compilers support both standard, host-based code as well as SYCL for C++ codes, and OpenMP offload in C, C++ and Fortran. Please note that the ‘classic’ Intel compilers (icc, icpc and ifort) have been deprecated or removed; only the ‘new’ compilers (icx, icpx and ifx) are supported and are the only ones that support GPUs.
To enable SYCL support:
icpx -fsycl
For OpenMP offload (note -fiopenmp, not -fopenmp):
# C
icx -fiopenmp -fopenmp-targets=spir64
# Fortran
ifx -fiopenmp -fopenmp-targets=spir64
Both Intel MPI and the oneMKL performance libraries support both CPU and the PVC GPUs, and can be found as follows:
module av intel-oneapi-mpi
module av intel-oneapi-mkl
Other recommendations¶
Further useful information about running on Intel GPUs can be found in Intel’s oneAPI GPU Optimization Guide.
Machine Learning & Data Science frameworks¶
We provide a set of pre-populated Conda environments based on the Intel Distribution for Python:
module av intelpython-conda
conda info -e
This module provides environments for PyTorch and Tensorflow.
Please note that Intel code and documentation sometimes refers to ‘XPUs’, a more generic term for accelerators, GPU or otherwise. For Dawn, ‘XPU’ and ‘GPU’ can usually be considered interchangeable.
PyTorch¶
PyTorch on Intel GPUs is supported by the Intel Extension for PyTorch. On Dawn this version of PyTorch is accessible as a conda environment named pytorch-gpu:
module load intelpython-conda
conda activate pytorch-gpu
Adapting your code to run on the PVCs is straightforward and only takes a few lines of code. For details, see the official documentation - but as a quick example:
import torch
import intel_extension_for_pytorch as ipex
...
# Enable GPU
model = model.to('xpu')
data = data.to('xpu')
model = ipex.optimize(model, dtype=torch.float32)
TensorFlow¶
Intel supports optimised TensorFlow on both CPU and GPU, using the Intel Extension for TensorFlow. On Dawn this version on TensorFlow is accessible as a conda environment named tensorflow-gpu:
module load intelpython-conda
conda activate tensorflow-gpu
To run on the PVCs, there should be no need to modify your code - the Intel optimised implementation will run automatically on the GPU, assuming it has been installed as intel-extension-for-tensorflow[xpu].
Jax/OpenXLA¶
Documentation can be found on GitHub: Intel OpenXLA
Julia¶
This is currently known not to work correctly on PVC GPUs. (Mar 2024)