The postponed Fram cooling maintenance will take place on Wednesday 27th starting at 07:00 until Thursday 28th 20:00, but depending on amount of work and diagnostics needed might be shorter or longer.
The maintenance will be conducted to diagnose and mitigate recent cooling issues and related crashes we have seen on Fram. We hope to have a better view of the situation and an action plan for making the whole cluster available for production. Currently we are at 80% capacity and will continue at that level to keep the system stable.
One of the core infiniband switches shut down on Betzy. This caused connectivity issues for the queueing system and caused jobs not to be scheduled/run. We are still investigating the issue to find the exact cause.
Dear Fram user. We are in the process of diagnosing and fixing several issues with Fram supercomputer. We have identified at least two hardware related issues to cooling equipment and a faulty CPU in a non-redundant server (queueing server). We have mitigated the issues temporarily and we try to slowly bring the system back in a stable configuration.
Fram will run in 50% capacity until Tuesday 10:00 and then with ca 75 % capacity until Wednesday 10:00 when we will attempt to run at 100%.
A maintenance window is still needed to fix both the faulty CPU and the issues with cooling capacity. We will update these pages when we know more about the maintenance.
UPDATE: Fram availability is now at 80%
There is a plan to physically investigate the cooling system issues with vendors on the 12th of July 2022. It means that the entire Fram cluster would be switched off and unavailable. It would take at least one full day to stabilize the cooling and take the system back into production. The date might be changed.
Update 11 July 2022 10:40: This maintenance is postponed to at least next week due to transport challenges.
Slurm controller on Fram is crashed, we are investigating.
Update 08.07.2022 14:00 : Fram workload manager node (slurm) crashed again and all running jobs are died.
There is hardware issue discovered, we are in contact with vendor and doing the tests. we will keep you updated.
Access to login nodes will be still open until planned Fram downtime.
It appears that the cooling on Fram has failed. The result is that many compute nodes are unavailable. We are investigating.
Update 05.07.2022 : We are still experiencing the same failure; investigating the issue together with vendors. The capacity of Fram might drop suddenly.
Dear Betzy cluster users:
We are experiencing I/O problem on Betzy filesystem. The filesystem usage is around 73%, but some filesystem servers are more than 90% full. This causes IO slowness, inconsistent performance, and Lustre disconnects. The following directories are affected: /cluster/shared, /cluster/projects, and /cluster/work. To keep the usage down and improve the IO performance and stability on Betzy we ask all users to remove unneeded data from those directories.
We are working on moving the data between different disk pools, which will hopefully fix the IO issues. The challenge with that is that the moved data has to be guaranteed unused (files unopened) during the process. We are looking into doing this while in production. If this is not possible, we will have to call for an emergency maintenance stop.
We will keep you updated.
Update 04.07.2022 12:30: We manage to balance filesystem usage over the last two weeks, so I/O problems should be resolved, please contact us if you encounter any problem with I/O on Betzy
Unfortunately, the cooling issues persist. We are investigating and contacting service personnel to check the main cooling distribution units (CDU).
Sorry for the inconvenience this is causing.
Update from 21.06 13:30: Cooling issues resolved, FRAM is open for users.
ADF license file has been updated to cover all login nodes on Saga. Thus gui should now work on all login!
Dear LUMI users,
As the system continues to grow, it is necessary to perform a maintenance break for extensive upgrades. The current plan is to start on Monday the 6 June and continue for about 4 weeks.
Unfortunately access to the system won’t be possible during the downtime.
If you need any assistance, please do not hesitate to contact the LUMI User Support Team: https://lumi-supercomputer.eu/user-support/need-help/
Read the full service announcement from LUST (external)