Fram availability at 50%

Dear Fram user. We are in the process of diagnosing and fixing several issues with Fram supercomputer. We have identified at least two hardware related issues to cooling equipment and a faulty CPU in a non-redundant server (queueing server). We have mitigated the issues temporarily and we try to slowly bring the system back in a stable configuration.

Fram will run in 50% capacity until Tuesday 10:00 and then with ca 75 % capacity until Wednesday 10:00 when we will attempt to run at 100%.

A maintenance window is still needed to fix both the faulty CPU and the issues with cooling capacity. We will update these pages when we know more about the maintenance.

UPDATE: Fram availability is now at 80%

Fram cooling system maintenance 12.07.22

There is a plan to physically investigate the cooling system issues with vendors on the 12th of July 2022. It means that the entire Fram cluster would be switched off and unavailable. It would take at least one full day to stabilize the cooling and take the system back into production. The date might be changed.

Update 11 July 2022 10:40: This maintenance is postponed to at least next week due to transport challenges.

Fram: slurm crashed

Slurm controller on Fram is crashed, we are investigating.

Update 08.07.2022 14:00 : Fram workload manager node (slurm) crashed again and all running jobs are died.

There is hardware issue discovered, we are in contact with vendor and doing the tests. we will keep you updated.

Access to login nodes will be still open until planned Fram downtime.

Betzy filesystem issue

Dear Betzy cluster users:

We are experiencing I/O problem on Betzy filesystem. The filesystem usage is around 73%, but some filesystem servers are more than 90% full. This causes IO slowness, inconsistent performance, and Lustre disconnects. The following directories are affected: /cluster/shared, /cluster/projects, and /cluster/work. To keep the usage down and improve the IO performance and stability on Betzy we ask all users to remove unneeded data from those directories.

We are working on moving the data between different disk pools, which will hopefully fix the IO issues. The challenge with that is that the moved data has to be guaranteed unused (files unopened) during the process. We are looking into doing this while in production. If this is not possible, we will have to call for an emergency maintenance stop.

We will keep you updated.

Update 04.07.2022 12:30: We manage to balance filesystem usage over the last two weeks, so I/O problems should be resolved, please contact us if you encounter any problem with I/O on Betzy

LUMI maintenance break starting 6 June

Dear LUMI users,

As the system continues to grow, it is necessary to perform a maintenance break for extensive upgrades. The current plan is to start on Monday the 6 June and continue for about 4 weeks.

Unfortunately access to the system won’t be possible during the downtime.

If you need any assistance, please do not hesitate to contact the LUMI User Support Team:

Read the full service announcement from LUST (external)

[DONE] Betzy downtime 13th June – 15th June

[Update, 2022-06-14] The maintenance stop was finished late last night.

**UPDATE** Downtime has started

Downtime starts 13th June 08:00 and last until 15th June 16:00.

A fault has been discovered in one of the switches in the main power board for Betzy compute nodes. Downtime is required to swap out the switch. We will take the opportunity to do further hardware and software maintenance and also implement and test an “emergency shutdown/reset” procedure for the whole of Betzy.

No services requiring access to any part of the system, including login nodes and storage services (NFS exported directories or backup directories), will be available during downtime, but some parts of the system (mainly storage and login) may have shorter downtime than other parts.