CSD3 Upgrade October 2021

Important

  • Final retirement of Wilkes2-GPU (P100/Pascal) and Peta4-KNL will occur on 28th February 2022

Important

Hardware upgrades in October-November 2021

  • New Ice Lake CPU and A100/Ampere GPU hardware becomes generally available in October 2021.
  • At the same time, the older Skylake and P100/Pascal nodes begin to be phased out of service.
  • GPU costs were updated on 1st November 2021.
  • A100/Ampere became the primary GPU cluster on 10th November 2021. All paid GPU hours have been rescaled for A100 and therefore should no longer be used on P100, 50% of which will remain available for a period to be determined.

Key Dates

Monday 27th September 2021

  • One Ice Lake and one Ampere rack each become available for code testing (not production runs) by submitting jobs to the icelake and ampere partitions. No special requests for access to these will be necessary, but jobs requesting more than 4 hours submitted to these partitions will not start.
  • DiRAC Technical Commissioning jobs should continue to use their existing reserved resources. These resources are likely to change during the pre-launch period.

Friday 1st October 2021

  • DiRAC Technical Commissioning projects will lose their special reservations during the night of Friday 1st - Saturday 2nd October. All DiRAC projects may continue to submit jobs of no more than 4 hours in length to the icelake and ampere partitions, along with all other types of project, outside of any reservation. New arrangements for larger scale DiRAC Technical Commissioning jobs will be made for Monday 18th October.

Monday 18th October 2021

  • Ice Lake CPU racks become available for production runs.
  • The first generation of Skylake nodes (mostly) leave production service. The Cascade Lakes continue in service as now.
  • DiRAC-owned and a small number of himem Skylake nodes will be retained, but the bulk of CPU capacity moves to Cascade Lake and Ice Lake.
  • P100/Pascal GPU nodes remain available through October.

Wednesday 20th October 2021

  • Ampere GPU racks become available for production runs.

Monday 25th October 2021

  • DiRAC GPU allocations are moved from P100 to A100. Please note that P100/Pascal submissions will no longer work for DiRAC projects, since P100 DiRAC service has come to an end and all DiRAC GPU hours have been recalculated for A100. Projects which for technical or reproducibility reasons cannot use A100 should contact support@hpc.cam.ac.uk.

Wednesday 10th November 2021

  • 50% of P100/Pascal GPU nodes leave production service. The A100/Ampere GPU nodes become the primary GPU platform.
  • GPU hour credits in SL2 projects are rescaled to report A100 GPU hours instead of P100 GPU hours. These projects should therefore no longer be used on P100/Pascal at this point.
  • The graphical visualization nodes, plus a number of P100 nodes for legacy use, will be retained for a limited time to be determined. All GPU users should aim to use the superior A100/Ampere nodes wherever possible.

Migrating to Ice Lake CPU and Ampere GPU

CentOS7 replaced by CentOS8

The operating system used by the Ice Lake and Ampere nodes is CentOS8, whereas the Cacade Lake (cclake), Skylake (skylake) and P100/Pascal (pascal) nodes continue to run CentOS7. It is important that self-compiled applications and private python/conda environments are rebuilt for use on CentOS8 and that old modules are not blindly carried over from job scripts running on CentOS7.

Important

New login nodes running CentOS8 are now available:

ssh userid@login-icelake.hpc.cam.ac.uk

The intention is ultimately to move all parts of the system to CentOS8 (or something equivalent), however initially, as with previous O/S upgrades, we will introduce the new O/S alongside the existing one to allow time for migrating applications.

Accessing Ice Lake and Ampere

The Ice Lake CPU and A100/Ampere GPU nodes are available through the Slurm partitions icelake and ampere respectively (there is also an icelake-himem partition).

CPU and GPU projects will permit the submission of jobs to these partitions in the usual way, and jobs will run when capacity becomes available.

During the initial testing/porting period in October, jobs longer than 4 hours will be prevented from running on the single racks of the new hardware made available in view of their modest size and of the fact that their purpose is code testing and building, not the execution of production runs.

During September and October, no distinction in accounting will be made between a P100 GPU hour and an A100 GPU hour - the usual GPU projects will be debited by one unit whether that GPU hour is consumed on P100 or A100. The A100 GPUs are significantly more powerful than the P100s, but also more expensive. The following page from NVIDIA contains further information and performance comparisons (we have the top-end 80GB version):

Important

On 1st November, the cost per GPU hour will be adjusted to reflect the higher hardware cost to £0.55 and existing GPU balances rescaled.

The cost per CPU hour will not change with the introduction of the Ice Lakes and will remain at £0.01.

For further details regarding migrating to Ice Lake or A100 please see the following pages:

Questions

If you any questions about these developments or issues when using the new nodes please contact us at support@hpc.cam.ac.uk.