CSD3 Planned Service Reductions October 2023

Important

  • HPC job scheduling resumed on 2nd November.
  • Planned upgrade work is complete.

Overview

As part of complex and essential physical changes and upgrades in the Research Computing data hall, there will be some days in October on which it will not be possible to run a full HPC service. Information about these periods and their impact will be made available on this page.

Current Status

Updated: 12:30 02/11/2023

  • Production service has resumed in the West Cambridge Data Centre following the major power incident on 18/10/2023, and successful previously scheduled installation and upgrade work.

Key Dates

Tuesday 3rd October 2023

This work was completed.

On Tuesday 3rd October, there will be an all-day maintenance period commencing at 08:00 during which large-scale load testing will take place associated with the work to install new compute hardware and cooling infrastructure in the data centre.

All jobs will be requeued at commencement, and normal job scheduling will be suspended throughout the day.

We don’t currently expect to suspend login access or job submission, but user jobs will not start until maintenance is complete.

Several large service-related jobs will occupy the cluster and service should be considered at risk. There may be brief interruptions to home directory and command access, and the login-icelake nodes in particular may go down briefly (please log off your login node if you are requested to do so).

Monday 16th October - Friday 27th October 2023

This work was completed.

During this period intensive testing of a significant addition to compute capacity in the data hall will require a reduction in the CPU capacity servicing user jobs in order to keep within current power and cooling limitations.

This reduction:

  • Is not currently expected to affect GPU capacity.
  • Will primarily affect the older RedHat7 (cclake) compute nodes from Thursday 19th (if you have not yet worked out how to run on RedHat8 icelake or sapphire, it will become increasingly advantageous to do so - please contact support if help is required).
  • Ice Lake nodes will also be unavailable a couple of racks at a time for periods of several hours each during 16th-20th October to allow the replacement of the DLC (water cooling) units which are necessary for their operation.

Further detail about this phase will appear here later. Please note that in the event of environmental issues during this period it may be necessary to make further temporary reductions to service at short notice.

Questions

If you have any questions about these developments or have issues before, during or after these periods, please contact us at support@hpc.cam.ac.uk.