File system trouble on Fram

Major incident High Performance Computing Fram
2023-05-08 12:45 CEST · 1 day, 22 hours, 40 minutes

Updates

Resolved

The filesystem has been up and running for some time now, and even if there still is an issue here, we can not resolve this without taking the system down completely.

We are therefore declaring this issue as resolved, as it does operate correctly now, and the issue identified will be addressed next time we require downtime on the system

May 10, 2023 · 11:24 CEST
Update

We have identified the problem and it seems that it will take some time to get this working. One of the IO-servers are down, and we are working on getting it up and running. This will hopefully be resolved as quickly as possible, hopefully tomorrow morning.

Status now:
We have some major problems with one of the IO-servers on fram. /cluster seems to be back, but we do not know if it will be stable.
All jobs that was running when the filesystem went down was suspended. The hope was that they could be resumed again when we have the IO-servers up again. We have now resumed all jobs, but some will have failed before we got around to suspending the jobs due to IO errors.
New jobs are again allowed to be queued.

We are sorry for the inconvenience this is.

May 8, 2023 · 16:46 CEST
Update

We have suspended all jobs, and stopped any new jobs from queueing as they will fail as soon as they try to access the filesystem.

Until the filesystem is back up and running, it will also not be possible to log onto Fram.

May 8, 2023 · 12:51 CEST
Issue

When one of the UPS’s on Fram was shut down due to maintenance, the filesystem on Fram went down. Even though the controllers are ment to be able to be powered from just one UPS.

We are currently investigating and trying to get this up and running as fast as possible. It is currently not known what problems this will cause for running jobs.

There will be updates as soon as we have more control of what happened.

May 8, 2023 · 12:45 CEST

← Back