Fram compute nodes down

Minor incident High Performance Computing Fram
2024-12-09 22:42 CET · 1 week, 2 days, 15 hours, 32 minutes

Updates

Resolved

All nodes are back in production and the cooling issue has been resolved.

-Infra Team

December 19, 2024 · 14:13 CET
De-escalate

De-escalating incident as most of the system is up. We are keeping the system capacity at 60-70% until service from the vendor has been completed. We expect the service to be completed tomorrow.

December 12, 2024 · 10:56 CET
Update

The cooling system went down last night. This triggered auto-poweroff of compute nodes.
The cooling system is running again now, and we have started compute nodes again. But we have reduced the load with 30-40 %, awaiting service from the vendor. This can lead to longer wait time in the job queue.

December 10, 2024 · 13:58 CET
Issue

Most compute nodes went down at around 22:42 yesterday. We are investigating the cause.

December 10, 2024 · 08:54 CET

← Back