Trouble with the filesystem on Betzy

We are diagnosing reports of filesystem errors on Betzy. Subsequently diagnostics might make the filesystem unstable for a couple of hours. Hopefully users will not notice.

Update 11.08 @ 13:55:
We are running fsck as well as rebuilding some internal metadata. This process can result in performance degradation of filesystem operations in submitted jobs. Due to the nature of these checks, it is expected to last until tomorrow.

Update 12.08 @ 10:45:
Work on the filesystem is still ongoing and potential disruptions could still occur during the day.

Update 15.08 @ 10:55:

The filesystem issue is resolved and Betzy fs is working. If you should experience any problems please contact support.

Fram maintenance 27th July

The postponed Fram cooling maintenance will take place on Wednesday 27th starting at 07:00 until Thursday 28th 20:00, but depending on amount of work and diagnostics needed might be shorter or longer.

The maintenance will be conducted to diagnose and mitigate recent cooling issues and related crashes we have seen on Fram. We hope to have a better view of the situation and an action plan for making the whole cluster available for production. Currently we are at 80% capacity and will continue at that level to keep the system stable.

Fram availability at 50%

Dear Fram user. We are in the process of diagnosing and fixing several issues with Fram supercomputer. We have identified at least two hardware related issues to cooling equipment and a faulty CPU in a non-redundant server (queueing server). We have mitigated the issues temporarily and we try to slowly bring the system back in a stable configuration.

Fram will run in 50% capacity until Tuesday 10:00 and then with ca 75 % capacity until Wednesday 10:00 when we will attempt to run at 100%.

A maintenance window is still needed to fix both the faulty CPU and the issues with cooling capacity. We will update these pages when we know more about the maintenance.

UPDATE: Fram availability is now at 80%

Fram cooling system maintenance 12.07.22

There is a plan to physically investigate the cooling system issues with vendors on the 12th of July 2022. It means that the entire Fram cluster would be switched off and unavailable. It would take at least one full day to stabilize the cooling and take the system back into production. The date might be changed.

Update 11 July 2022 10:40: This maintenance is postponed to at least next week due to transport challenges.

Fram: slurm crashed

Slurm controller on Fram is crashed, we are investigating.

Update 08.07.2022 14:00 : Fram workload manager node (slurm) crashed again and all running jobs are died.

There is hardware issue discovered, we are in contact with vendor and doing the tests. we will keep you updated.

Access to login nodes will be still open until planned Fram downtime.