Fram compute nodes down
Minor incident
High Performance Computing
Fram
2024-12-09 22:42 CET
· 1 week, 2 days, 15 hours, 32 minutes
Updates
Resolved
December 19, 2024 · 14:13 CET
All nodes are back in production and the cooling issue has been resolved.
-Infra Team
De-escalate
December 12, 2024 · 10:56 CET
De-escalating incident as most of the system is up. We are keeping the system capacity at 60-70% until service from the vendor has been completed. We expect the service to be completed tomorrow.
Update
December 10, 2024 · 13:58 CET
The cooling system went down last night. This triggered auto-poweroff of compute nodes.
The cooling system is running again now, and we have started compute nodes again. But we have reduced the load with 30-40 %, awaiting service from the vendor. This can lead to longer wait time in the job queue.
Issue
December 10, 2024 · 08:54 CET
Most compute nodes went down at around 22:42 yesterday. We are investigating the cause.
← Back