Performance Tips
================

Compiler Information and Options
--------------------------------

The manual pages for the different compiler suites are available:

GCC
    Fortran
        ``man gfortran``

    C/C++
        ``man gcc``

Intel
    Fortran
        ``man ifort``

    C/C++
         ``man icc``

Useful compiler options
~~~~~~~~~~~~~~~~~~~~~~~

Whilst difference codes will benefit from compiler optimisations in
different ways, for reasonable performance, at least
initially, we suggest the following compiler options:

Intel
    ``-O2``
GNU
    ``-O2 -ftree-vectorize -funroll-loops -ffast-math``

To target the specific hardware on CSD3 use the following options:

+-----------+----------------------+-------------------------------+
| Partition | Intel                | GCC                           |
+===========+======================+===============================+
| cclake    | ``-xCASCADELAKE``    | ``-march=cascadelake``        |
+-----------+----------------------+-------------------------------+
| icelake   | ``-xICELAKE-SERVER`` | ``-march=icelake-server``     |
+-----------+----------------------+-------------------------------+
| sapphire  | ``-xSAPPHIRERAPIDS`` | ``-march=sapphirereapids``    |
+-----------+----------------------+-------------------------------+
| ampere    |                      | ``-march=znver3``             |
+-----------+----------------------+-------------------------------+

Alternatively, login to a machine with the same architecture that you will be
running on and use

Intel
    ``-xHost``
GNU
    ``-march=native``


When you have a application that you are happy is working correctly and has
reasonable performance you may wish to investigate some more aggressive
compiler optimisations. Below is a list of some further optimisations
that you can try on your application (Note: these optimisations may
result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions):

Intel
    ``-fast``
GNU
    ``-Ofast -funroll-loops``

Vectorisation, which is one of the important compiler optimisations for
any modern Intel hardware, is enabled by default as follows:

Intel
    At ``-O2`` and above
GNU
    At ``-O3`` and above or when using ``-ftree-vectorize``

To promote integer and real variables from four to eight byte precision
for FORTRAN codes the following compiler flags can be used:

Intel
    ``-real-size 64 -integer-size 64 -xAVX``
    (Sometimes the Intel compiler incorrectly generates AVX2
    instructions if the ``-real-size 64`` or ``-r8`` options are set.
    Using the ``-xAVX`` option prevents this.)
GNU
    ``-freal-4-real-8 -finteger-4-integer-8``

GPU Direct (GDR)
----------------

One of the key technologies to get the most performance out of the GPU system
is GDR. This allows GPUs to communicate via MPI without waiting for the host
CPU. This is implemented in both OpenMPI (the default) and MVAPICH2.

The functionality and performance of GDR can be tested using the OSU
micro-benchmarks suite. For example, using OpenMPI::

    module purge
    module load rhel7/default-gpu

    OSU_HOME=$HOME/osu-micro-benchmarks-5.4.2

    unset OMP_NUM_THREADS

    echo WITH GDR


    mpirun -np 2 --map-by ppr:1:node  \
      --mca mtl ^mxm --mca pml ^yalla --mca btl self,openib --mca btl_openib_want_cuda_gdr 1 \
      $OSU_HOME/get_local_rank $OSU_HOME/mpi/pt2pt/osu_latency -f D D

and using MVAPICH2::

    module purge
    module load rhel7/default-gpu
    module unload gcc-5.4.0-gcc-4.8.5-fis24gg openmpi-1.10.7-gcc-5.4.0-jdc7f4f
    module load mvapich2-GDR/gnu/2.3a_cuda-8.0

    OSU_HOME="${MPI_HOME}/libexec/osu-micro-benchmarks"

    unset OMP_NUM_THREADS

    export MV2_ENABLE_AFFINITY 1 
    export MV2_USE_CUDA 1 
    export MV2_USE_GPUDIRECT 1

    mpirun -np 2 -ppn 1 -genvall $OSU_HOME/get_local_rank $OSU_HOME/mpi/pt2pt/osu_latency D D

MVAPICH2 includes a copy of the benchmark suite in the distribution whereas
with OpenMPI a custom version has been downloaded and built.