Fram downtime 11th – 12th May 2022

[UPDATE, 2022-05-13 08:00] The maintenance stop is over. There may still be some file system issues. Please report via regular support channels

[UPDATE, 2022-05-11 08:00] The maintenance stop has now started.

The will be a maintenance of the storage system on the Fram supercomputer. The cluster will be unavailable from the 11th of May at 08:00 until the 12th of May at 20:00.

Fram downtime 23rd – 24th February

[Update, 2022-02-24 22:30]: The maintenance is over and Fram is in production again. Thank you for your patience!

[Update, 2022-02-24 20:30]: The maintenance is taking a little longer than planned. We plan to get back into production at 22:00.

[Update, 2022-02-23 12:00]: The maintenance stop has now begun.

Fram supercomputer will be unavailable due to maintenance on the cooling system from February 23rd 12:00 until 24th 20:00

If time allows it we will also upgrade whole or parts/components of the storage system, including file system clients (compute nodes)

Downtime Betzy, 6th December 08:00 until 9th December 20:00

There will be a scheduled downtime for Betzy lasting three days starting on Monday 6th December at 08:00. Downtime will last until Thursday 9th, 20:00.

During the downtime we will conduct:

  • Full upgrade of the Lustre filesystem (both servers and clients)
  • Full upgrade of the infiniband firmware
  • Full upgrade of the Mellanox infiniband drivers
  • minor updates to other parts of the system (Slurm, configs, etc)

Please be aware that this does also affect the storage services recently moved from NIRD to Betzy.

We apologize for the inconvenience

Update 08.12.2021 18:00 : Betzy downtime is over, and system is open for users. All planned update is performed .

[Solved] Saga file system performance issue

We’re aware of ongoing issues with the file system performance on Saga and are investigating the cause. This also affects logging in to Saga, where the terminal will hang waiting for a prompt.

Updates will be provided in this post as soon as we have more information to share.

Sorry for the inconvenience.

Update 2021-07-15, 16:33: The issue was identified as a faulty connection between the storage server and the cluster. Performance should be back to normal, but we will monitor the system a bit more before declaring it healthy.
Update 2021-07-14, 15:00: We’ve discovered some faulty drives that are currently being swapped. We hope that the performance will improve once these are in production again.
Update 2021-07-13, 10:03: The file system is a bit more stable now, but we’re still looking into the cause for the degraded performance.