Cooling issues on Fram

Major incident High Performance Computing Fram
2023-04-12 09:17 CEST · 14 minutes



What happened?
Some work was done on a UPS connected to our cooling system. In theory, nothing should have happened, but it turned out it did. We lost 50% of our cooling for Fram due to this. As a countermeasure we suspended all running jobs and stopped all new jobs from starting. We did this to limit the amount of heat generated by our systems in order to not overwhelm the limited cooling we had left. This lasted about an hour.

Status now:
We have our cooling back online. We have resumed all jobs that got suspended, and allowed new jobs to start. In theory, most jobs should not encounter any problems by being temporary suspended, but it can happen. Please check if your job is doing OK. If not the job will need to be requeued by you. If you have any questions about this, please contact support at: [email protected]

Sorry for any inconvenience this might have caused.

April 12, 2023 · 09:31 CEST

One of the cooling units on Fram has shut down and we have had to suspend all jobs while working on bringing the machine back to a stable state.

April 12, 2023 · 09:20 CEST

← Back