Performance Tips¶

Compiler Information and Options¶

The manual pages for the different compiler suites are available:

GCC

Fortran: man gfortran
C/C++: man gcc

Intel

Fortran: man ifort
C/C++: man icc

Useful compiler options¶

Whilst difference codes will benefit from compiler optimisations in different ways, for reasonable performance, at least initially, we suggest the following compiler options:

Intel: -O2
GNU: -O2 -ftree-vectorize -funroll-loops -ffast-math

To target the specific hardware on CSD3 use the following options:

Partition	Intel	GCC
cclake	`-xCASCADELAKE`	`-march=cascadelake`
icelake	`-xICELAKE-SERVER`	`-march=icelake-server`
sapphire	`-xSAPPHIRERAPIDS`	`-march=sapphirereapids`
ampere		`-march=znver3`

Alternatively, login to a machine with the same architecture that you will be running on and use

Intel: -xHost
GNU: -march=native

When you have a application that you are happy is working correctly and has reasonable performance you may wish to investigate some more aggressive compiler optimisations. Below is a list of some further optimisations that you can try on your application (Note: these optimisations may result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions):

Intel: -fast
GNU: -Ofast -funroll-loops

Vectorisation, which is one of the important compiler optimisations for any modern Intel hardware, is enabled by default as follows:

Intel: At -O2 and above
GNU: At -O3 and above or when using -ftree-vectorize

To promote integer and real variables from four to eight byte precision for FORTRAN codes the following compiler flags can be used:

Intel: -real-size 64 -integer-size 64 -xAVX (Sometimes the Intel compiler incorrectly generates AVX2 instructions if the -real-size 64 or -r8 options are set. Using the -xAVX option prevents this.)
GNU: -freal-4-real-8 -finteger-4-integer-8

GPU Direct (GDR)¶

One of the key technologies to get the most performance out of the GPU system is GDR. This allows GPUs to communicate via MPI without waiting for the host CPU. This is implemented in both OpenMPI (the default) and MVAPICH2.

The functionality and performance of GDR can be tested using the OSU micro-benchmarks suite. For example, using OpenMPI:

module purge
module load rhel7/default-gpu

OSU_HOME=$HOME/osu-micro-benchmarks-5.4.2

unset OMP_NUM_THREADS

echo WITH GDR


mpirun -np 2 --map-by ppr:1:node  \
  --mca mtl ^mxm --mca pml ^yalla --mca btl self,openib --mca btl_openib_want_cuda_gdr 1 \
  $OSU_HOME/get_local_rank $OSU_HOME/mpi/pt2pt/osu_latency -f D D

and using MVAPICH2:

module purge
module load rhel7/default-gpu
module unload gcc-5.4.0-gcc-4.8.5-fis24gg openmpi-1.10.7-gcc-5.4.0-jdc7f4f
module load mvapich2-GDR/gnu/2.3a_cuda-8.0

OSU_HOME="${MPI_HOME}/libexec/osu-micro-benchmarks"

unset OMP_NUM_THREADS

export MV2_ENABLE_AFFINITY 1
export MV2_USE_CUDA 1
export MV2_USE_GPUDIRECT 1

mpirun -np 2 -ppn 1 -genvall $OSU_HOME/get_local_rank $OSU_HOME/mpi/pt2pt/osu_latency D D

MVAPICH2 includes a copy of the benchmark suite in the distribution whereas with OpenMPI a custom version has been downloaded and built.