Performance Tips ================ Compiler Information and Options -------------------------------- The manual pages for the different compiler suites are available: GCC Fortran ``man gfortran`` C/C++ ``man gcc`` Intel Fortran ``man ifort`` C/C++ ``man icc`` Useful compiler options ~~~~~~~~~~~~~~~~~~~~~~~ Whilst difference codes will benefit from compiler optimisations in different ways, for reasonable performance, at least initially, we suggest the following compiler options: Intel ``-O2`` GNU ``-O2 -ftree-vectorize -funroll-loops -ffast-math`` To target the specific hardware on CSD3 use the following options: +-----------+----------------------+-------------------------------+ | Partition | Intel | GCC | +===========+======================+===============================+ | cclake | ``-xCASCADELAKE`` | ``-march=cascadelake`` | +-----------+----------------------+-------------------------------+ | icelake | ``-xICELAKE-SERVER`` | ``-march=icelake-server`` | +-----------+----------------------+-------------------------------+ | sapphire | ``-xSAPPHIRERAPIDS`` | ``-march=sapphirereapids`` | +-----------+----------------------+-------------------------------+ | ampere | | ``-march=znver3`` | +-----------+----------------------+-------------------------------+ Alternatively, login to a machine with the same architecture that you will be running on and use Intel ``-xHost`` GNU ``-march=native`` When you have a application that you are happy is working correctly and has reasonable performance you may wish to investigate some more aggressive compiler optimisations. Below is a list of some further optimisations that you can try on your application (Note: these optimisations may result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions): Intel ``-fast`` GNU ``-Ofast -funroll-loops`` Vectorisation, which is one of the important compiler optimisations for any modern Intel hardware, is enabled by default as follows: Intel At ``-O2`` and above GNU At ``-O3`` and above or when using ``-ftree-vectorize`` To promote integer and real variables from four to eight byte precision for FORTRAN codes the following compiler flags can be used: Intel ``-real-size 64 -integer-size 64 -xAVX`` (Sometimes the Intel compiler incorrectly generates AVX2 instructions if the ``-real-size 64`` or ``-r8`` options are set. Using the ``-xAVX`` option prevents this.) GNU ``-freal-4-real-8 -finteger-4-integer-8`` GPU Direct (GDR) ---------------- One of the key technologies to get the most performance out of the GPU system is GDR. This allows GPUs to communicate via MPI without waiting for the host CPU. This is implemented in both OpenMPI (the default) and MVAPICH2. The functionality and performance of GDR can be tested using the OSU micro-benchmarks suite. For example, using OpenMPI:: module purge module load rhel7/default-gpu OSU_HOME=$HOME/osu-micro-benchmarks-5.4.2 unset OMP_NUM_THREADS echo WITH GDR mpirun -np 2 --map-by ppr:1:node \ --mca mtl ^mxm --mca pml ^yalla --mca btl self,openib --mca btl_openib_want_cuda_gdr 1 \ $OSU_HOME/get_local_rank $OSU_HOME/mpi/pt2pt/osu_latency -f D D and using MVAPICH2:: module purge module load rhel7/default-gpu module unload gcc-5.4.0-gcc-4.8.5-fis24gg openmpi-1.10.7-gcc-5.4.0-jdc7f4f module load mvapich2-GDR/gnu/2.3a_cuda-8.0 OSU_HOME="${MPI_HOME}/libexec/osu-micro-benchmarks" unset OMP_NUM_THREADS export MV2_ENABLE_AFFINITY 1 export MV2_USE_CUDA 1 export MV2_USE_GPUDIRECT 1 mpirun -np 2 -ppn 1 -genvall $OSU_HOME/get_local_rank $OSU_HOME/mpi/pt2pt/osu_latency D D MVAPICH2 includes a copy of the benchmark suite in the distribution whereas with OpenMPI a custom version has been downloaded and built.