West Cambridge Data Centre Upgrade and Planned Disruption June 2025 - February 2026¶
Important
- This page describes the timeline and milestones of the West Cambridge Data Centre upgrade project, and the expected impacts on services run from the Research Computing Data Hall (CSD3, Dawn, SRCP, RFS, RDS, RCS, Arcus/IRIS/SKA/Gaia).
- Please notice the multi-day full maintenance planned 17-19th September 2025.
Last updated: Wed Aug 20 19:28:18 BST 2025
Overview¶
The project to upgrade the power and cooling systems in the West Cambridge Data Centre (WCDC) is now in its implmentation phase and is expected to require several disruptions to services run from it, between May 2025 and its completion in late 2025 or early 2026. The purpose of the project is to provide a sustainable increase to electrical and cooling capacity and so allow the expansion of services. There have already been some unexpected service disruptions in the course of the execution of the project, and more are possible as it progresses. Current information about the expected impacts to service and the status of the upgrade will be made available on this page.
Current Status¶
- CSD3
- Available (reduced capacity and at-risk)
- Dawn
- Available (at-risk)
- RDS/RCS
- Available
- RFS
- Available
- IRIS/Gaia Hypervisors
- Available
- Windows SRCP
- Available
- Linux SRCP
- Available
- Arcus/Other IRIS hypervisors
- Available
- [20/08/2027] (19:20) We are continuing to run tonight under increased (user plus artificial) load in order to prove the modified cooling system. We remain hopeful of a positive change in HPC service soon. The major incident is currently still open.
- [19/08/2025] (21:00) We are running this evening under an artificially increased load in order to prove the cooling system. If this is successful we may be in a position to increase production load.
- [15/08/2025] (17:15) Service continues to operate at 850KW. More work was performed today and the engineers will return on Monday in order to perform further tests before we will consider increasing production load.
- [14/08/2025] (18:00) Service is still operating at 850KW. We are awaiting a report today from the contractor but we understand that one chiller circuit requires further work tomorrow. The service remains at-risk.
- [13/08/2025] (16:00) Service is now operating at 850KW. Work will continue on one chiller circuit tomorrow but we will aim to maintain this operating level. The service remains at-risk.
- [13/08/2025] (14:00) We have clearance to start to increase load so jobs will begin to start. Please note that service remains at-risk.
- [13/08/2025] (09:45) The engineers are still working on a high pressure trip issue on one chiller this morning. We are expecting a further update from them at 13:00 after which we will determine the next step.
- [12/08/2025] (15:00) The engineers have rectified some issues but are continuing to work on the chillers. We will not be able to increase load today but if possible we will start to reapply load tomorrow (Wednesday 13th).
- [12/08/2025] (11:00) The engineers are investigating an issue with a pressure valve and advise that this will take another 2-3 hours.
- [12/08/2025] (10:00) We are waiting for clearance from the engineers to begin applying load.
- [11/08/2025] (15:35) We are still waiting for the engineers to confirm that the temporary cooling can be trusted overnight. Therefore no new jobs will be started after 16:15 today until Tuesday morning when load will be gradually restored under close monitoring. Users should place a hold on any jobs which cannot safely be killed and restarted (scontrol hold joblist) as any jobs which do start may have to be requeued if the cooling does fail.
- [11/08/2025] (09:15) CSD3 and Dawn nodes will run short jobs (no jobs with end times after 16:15 today will start for the moment).
- [08/08/2025] (16:00) Due to lack of confidence in the current state of the temporary chiller system, CSD3 and Dawn nodes will be powered off during the weekend from 16:00 Friday until Monday morning when the engineers will be onsite.
- [08/08/2025] (14:00) Approximately 50% of CSD3 compute nodes are currently running jobs; Dawn is operating normally however both services should be considered at-risk as the chiller failure remains under investigation.
- [07/08/2025] (18:50) Temperatures in DH1 are stable but no compute nodes are running. CSD3 and Dawn will remain in this state overnight until a reassessment of the chiller failure takes place on Friday morning.
- [07/08/2025] (15:30) Chiller failure massively reducing cooling capacity in DH1. Compute nodes are undergoing emergency shutdown.
Key Dates¶
July/August 2025¶
- Thursday August 7th 2025
- The first one day outage of Sapphire Rapids (regular and HBM) nodes took place on 7th August.
Pipework replacement impacting liquid cooled systems (icelake, sapphire rapids, dawn), row by row. This will allow connection of these systems to the new cooling system, and gradually free them from the current power constraints. Note that these liquid cooled systems include SRCP, Arcus/IRIS/SKA/Gaia VMs, CSD3 icelake, CSD3 Sapphire Rapids and UKAEA Sapphire Rapids HBM. We expect the impact to be as follows:
- a further outage of up to one day affecting VMs and Icelake login nodes, and two similar outages affecting Sapphire Rapids (regular and HBM) nodes.
- other services will manage around this by changing which nodes are available.
September 2025¶
- Disruption to rear door cooling, row by row.
- We expect to be able to manage around this with minimal service impact.
- This work will increase the resilience of the cooling, removing some single points of failure as well as increasing the cooling capacity ready for when the power capacity is increased.
- Rescheduling of delayed power sequencing work, to prepare for power upgrade to 1.8MW in the new year.
- This has been rescheduled to September 18th.
- This will affect the entire data centre and another full shutdown will be required September 17th-19th.
January/February 2026¶
- Repeat of power switching exercise to enable installation of the new distribution board.
- We should still have cooling during this work, but the success or otherwise of the sequencing work is likely to determine the risk appetite for keeping services online.
- Migration of DH1 Rows C-F [1] to new power infrastructure including new UPS, Generators and Transformer.
- This will disrupt high power systems which are not resilient, which is likely to be manageable by changing which nodes are available.
- DH1 capacity increases to 1.8MW.
February 2026¶
- Full commissioning.
- Disruption and risk to be determined.
Questions¶
If you have any questions about these developments or have issues before, during or after these periods, please contact us at support@hpc.cam.ac.uk.
Change Log¶
- [20/08/2025] (19:20) Status update - continuing to prove cooling.
- [19/08/2025] (21:00) Status update - artificial load.
- [15/08/2025] (17:15) Status update.
- [14/08/2025] (18:00) Update on chiller repair.
- [13/08/2025] (16:00) Update on partial sevice resumption.
- [13/08/2025] (14:00) Partial resumption of HPC service.
- [13/08/2025] (09:45) Update on chiller 2 high pressure fault.
- [12/08/2025] (15:00) Update on chiller repair.
- [12/08/2025] (11:00) Update on chiller repair.
- [12/08/2025] (10:00) Update on chiller repair.
- [11/08/2025] (15:35) Update on phased load increase on Tuesday.
- [11/08/2025] (09:15) Partial resumption of service (short jobs only).
- [08/08/2025] (15:55) Weekend suspension of service.
- [08/08/2025] Reduced capacity while cooling failure remains under investigation.
- [07/08/2025] (18:50) Chiller failure update - no jobs running overnight.
- [07/08/2025] Chiller failure.
- [05/08/2025] DLC pipework update.
- [29/07/2025] Transformer repair update.
- [25/06/2025] Full maintenance complete.
- [25/06/2025] Maintenance update.
- [24/07/2025] Maintenance update.
- [23/07/2025] Maintenance update.
- [22/07/2025] Maintenance update post network blackout.
- [21/07/2025] Maintenance start. Per service status update.
- [18/07/2025] (17:04) IRIS/Gaia shutdown on July 20th clarified.
- [18/07/2025] Information added about July 21st-25th maintenance.
- [17/07/2025] Transformer repair work confirmed for July 29th.
- [11/07/2025] July 21st-25th rescheduled cooling and network maintenance confirmed. September 18th date for rescheduled power sequencing confirmed.
- [04/07/2025] Mark July 8-10 as cancelled.
- [27/06/2025] Update re July 8-10 and subsequent timeline.
- [24/06/2025] Update post June 24th events.
- [23/06/2025] Updated dates and details for work on July 8-10th.
- [17/06/2025] Warm weather update, transformer repair for 24th June added and July full maintenance update.
- [10/06/2025] Version string and change log added.
- [23/05/2025] Page created.
[1] | This refers to the racks in rows C-F in data hall 1. These contain elements of CSD3, SRCP, Arcus and storage, so parts of these services may be affected during these phases of the work. |