Fram maintenance 27th July

The postponed Fram cooling maintenance will take place on Wednesday 27th starting at 07:00 until Thursday 28th 20:00, but depending on amount of work and diagnostics needed might be shorter or longer.

The maintenance will be conducted to diagnose and mitigate recent cooling issues and related crashes we have seen on Fram. We hope to have a better view of the situation and an action plan for making the whole cluster available for production. Currently we are at 80% capacity and will continue at that level to keep the system stable.

Fram availability at 50%

Dear Fram user. We are in the process of diagnosing and fixing several issues with Fram supercomputer. We have identified at least two hardware related issues to cooling equipment and a faulty CPU in a non-redundant server (queueing server). We have mitigated the issues temporarily and we try to slowly bring the system back in a stable configuration.

Fram will run in 50% capacity until Tuesday 10:00 and then with ca 75 % capacity until Wednesday 10:00 when we will attempt to run at 100%.

A maintenance window is still needed to fix both the faulty CPU and the issues with cooling capacity. We will update these pages when we know more about the maintenance.

UPDATE: Fram availability is now at 80%

Fram cooling system maintenance 12.07.22

There is a plan to physically investigate the cooling system issues with vendors on the 12th of July 2022. It means that the entire Fram cluster would be switched off and unavailable. It would take at least one full day to stabilize the cooling and take the system back into production. The date might be changed.

Update 11 July 2022 10:40: This maintenance is postponed to at least next week due to transport challenges.

Fram: slurm crashed

Slurm controller on Fram is crashed, we are investigating.

Update 08.07.2022 14:00 : Fram workload manager node (slurm) crashed again and all running jobs are died.

There is hardware issue discovered, we are in contact with vendor and doing the tests. we will keep you updated.

Access to login nodes will be still open until planned Fram downtime.

Betzy filesystem issue

Dear Betzy cluster users:

We are experiencing I/O problem on Betzy filesystem. The filesystem usage is around 73%, but some filesystem servers are more than 90% full. This causes IO slowness, inconsistent performance, and Lustre disconnects. The following directories are affected: /cluster/shared, /cluster/projects, and /cluster/work. To keep the usage down and improve the IO performance and stability on Betzy we ask all users to remove unneeded data from those directories.

We are working on moving the data between different disk pools, which will hopefully fix the IO issues. The challenge with that is that the moved data has to be guaranteed unused (files unopened) during the process. We are looking into doing this while in production. If this is not possible, we will have to call for an emergency maintenance stop.

We will keep you updated.

Update 04.07.2022 12:30: We manage to balance filesystem usage over the last two weeks, so I/O problems should be resolved, please contact us if you encounter any problem with I/O on Betzy