There is a plan to physically investigate the cooling system issues with vendors on the 12th of July 2022. It means that the entire Fram cluster would be switched off and unavailable. It would take at least one full day to stabilize the cooling and take the system back into production. The date might be changed.
Update 11 July 2022 10:40: This maintenance is postponed to at least next week due to transport challenges.
It appears that the cooling on Fram has failed. The result is that many compute nodes are unavailable. We are investigating.
Update 05.07.2022 : We are still experiencing the same failure; investigating the issue together with vendors. The capacity of Fram might drop suddenly.
Unfortunately, the cooling issues persist. We are investigating and contacting service personnel to check the main cooling distribution units (CDU).
Sorry for the inconvenience this is causing.
Update from 21.06 13:30: Cooling issues resolved, FRAM is open for users.
Dear Fram cluster users:
We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.
Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.
Update 14:55: Some of the nodes are crashed, which means it’s possible that some of the jobs get killed
Update 2019.07.03 10:55: To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.
Update 2019.07.08 11:55: Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.