Cooling leak on fram + malfunction in cooling system

Major incident High Performance Computing Fram
2024-03-12 23:17 CET · 1 week, 2 days, 9 hours, 47 minutes

Updates

Resolved

The issue has been solved and the system is back in production.

March 22, 2024 · 08:57 CET
Monitoring

We have fixed the leakage and the system is back up. Currently running at 60% capacity over the weekend while we monitor the cooling solution and pumps. We are working with Uniq who are the service provider for the cooling solution and will increase to full load capacity as soon as possible.

Sorry for the inconveniences this has caused.

Best Regards,
Infra team.

March 15, 2024 · 11:13 CET
Update

The cooling is still not at full capacity.

We have put about half of the nodes back into production again, and resumed the jobs that were suspended there. That was about 2/3 of the suspended jobs. The remaining 1/3 of the suspended jobs we have had to requeue. They will start again as resources are available.

March 13, 2024 · 16:49 CET
Issue

We had a leak in the internal water cooling loop on fram. Because of this, we had to shut down parts of the system completely.

In connection to this, one of the pumps have malfunctioned, resulting in reduced cooling. To try to minimize the damage to the system, and to save as much of the ongoing jobs, we have suspended all running jobs, and stopped all new jobs from queueing. During tomorrow we will get a better overview of what the damage is, and if the jobs already started is recoverable. From previous experience, some of the jobs will fail, but some will be able to continue once the system is back up.

Sorry for the inconvenience. We will update you as soon as we know more.

NRIS team

March 12, 2024 · 23:17 CET

← Back