Cooling failure FRAM

Major incident High Performance Computing Fram
2023-04-12 11:20 CEST · 3 hours, 56 minutes

Updates

Resolved

After our first cooling problem another issue appaired. This time the secondary cooling system was shut down due to a failed Power supply. As this is a shared cooling loop with some more critical infrastructure, we decided to take down our entire system.

The problems identified have been resolved, and we are in contact about improving this system in order to have the ability to manually override it in the future in case anything like this happens again.

As the system was taken offline completely, we unfortunately was not able to keep the state of all running jobs. I can see that there are quite a few jobs that needs to be run again. If you are having any problems, or have further questions, please contact our support at [email protected]

April 12, 2023 · 15:16 CEST
Update

All issues with local cooling at FRAM is fixed. There is an external problem with the inlet water from campus site. Technicians are working on it and hopefully we are back in production within hours.
Will update as soon as FRAM is ready for production.

April 12, 2023 · 13:04 CEST
Issue

One of the central cooling units on FRAM is failing and causing FRAM to overheat. We are working on fixing it and will update as soon as we know more.
Fram will be unavailable until further notice.
Sorry for the inconvenience this is causing.

April 12, 2023 · 11:22 CEST

← Back