CSD3 Full Maintenance 13-14 June 2022

Important

  • Full maintenance began on schedule at 18:00 Monday 13th June. Login nodes and login-web are not currently available.
  • Update 12:00 Tuesday 14th June: the bulk of the data transfers have now been completed and we are verifying. Progress is according to plan.
  • Update 19:00 Tuesday 14th June: we have resumed job scheduling but have not yet restored login access. We plan to monitor the operation of the cluster overnight with a view to reopening normal access first thing Wednesday morning. Thankyou for your continued patience.
  • Update 08:55 Wednesday 15th June: login access has been restored. The maintenance is now complete.

Important

  • CSD3 will be unavailable due to a planned, full maintenance beginning at 18:00 Monday 13th June 2022 and lasting throughout Tuesday 14th June.
  • This will affect everything attached to the CSD3 filesystems, including privately owned/special nodes.

All services on CSD3 (and access to all attached nodes) will be suspended during a period of full maintenance planned to begin at 18:00 on Monday 13th June 2022, and which is expected to last all day on the following Tuesday 14th June. The purpose of this maintenance is to replace the Nexenta storage system providing the /home and /usr/local directories with a new Isilon storage system. This will entail reconfiguring and rebooting every attached node, which requires that all user activity be absent.

The up to date plan and status of the maintenance will be displayed on this page.

Key Points

Prior to 13th June

  • Initial copies of /home and /usr/local will be made on the new Isilon storage. This is the bulk of the copying and will take place without disturbing production operations.
  • Any issues (e.g. quota problems) present on the original filesystems will be corrected at this stage.

18:00 Monday 13th June 2022

  • Job scheduling will be suspended.
  • All running jobs will be requeued. This means that unless the job has intentionally been marked as non-requeuable, the job will be killed, then returned to the queue to run again (from the beginning) when job scheduling has been resumed. Jobs which cannot simply be killed and restarted should be marked non-requeuable with the –no-requeue sbatch option.
  • All login nodes will reboot and login access blocked to ensure no users remain connected.
  • Slurm will be stopped.
  • The entire cluster will then reboot with modified filesystem settings to clear any remaining references to, or activity on, the original /home and /usr/local filesystems.
  • During the night of 13-14th June, the final synchronisation of files between the old and new filesystems will be undertaken.

Tuesday 14th June 2022

  • The final synchronisation of data will be completed.
  • Cluster nodes will be reconfigured to accept the new filesystems as /home and /usr/local.
  • The new filesystems will be mounted on all nodes and the correct functionality confirmed.
  • Slurm will be restarted and tested.

Wednesday 15th June 2022

  • Normal production service will be resumed.

Questions

If you any questions about these developments or have issues before or after the changes, please contact us at support@hpc.cam.ac.uk.