We are currently experiencing issues with the filesystem on Betzy.
The problem affects mainly /cluster/work area.
We are trying to fix this as quickly as possible.
[UPDATE: 17:00] Problem with filesystem persists. Login nodes are now offline. We have contacted the vendor.
[UPDATE: 19:30] login nodes are back online.
Some users may experience that their access to project directories on HPC is revoked.
This is something we are aware of and know the root cause. We are currently working on restoring access for projects that are affected.
We are sorry about the inconvenience.
One of the Betzy file system disk controllers has an issue that need to be rectified. This will cause lower performance on file system for a couple of hours today starting at 13:00
One of the core infiniband switches shut down on Betzy. This caused connectivity issues for the queueing system and caused jobs not to be scheduled/run. We are still investigating the issue to find the exact cause.
[Update, 2022-06-13 12:30]: The maintenance is now finished.
Sikt (our network provider) will do maintenance of their network switches connecting to Saga and Betzy today (2022-06-13). The maintenance should not be noticeable by users.
[Update, 2022-04-30 11:10] The Fram and Saga maintenance is now over, and jobs are running again.
[Update, 2022-04-29 08:00] The Fram and Saga maintenances have now started.
[Update, 2022-04-28 12:56] The Betzy maintenance is now over, and jobs are starting again.
[Update, 2022-04-28 08:00] The Betzy maintenance has now started.
There will unfortunately be maintenance stops on all NRIS clusters next week, for an important security update. The maintenance stops will be
- Betzy: Thursday, April 28. at 08:00
- Fram and Saga: Friday, April 29. at 08:00
We expect the stops will last a couple of hours. We have set up maintenance reservations on all nodes on the clusters, so jobs that would have run into the reservation will be left pending in the job queue until after the maintenance stop.
We are sorry for the inconvenience this creates. We had hoped to be able to apply the security update with jobs running, but that turned out not to be possible.
The queue system configuration of the GPU nodes on Betzy had an error: The number of CPUs were set to 128 instead of 64. Most jobs would probably not be affected by this, but it is possible that some jobs got sub-optimal cpu pinnings.
This has now been fixed, and the documentation updated. There is nothing users have to do with their job scripts (except if they asked for more than 64 cpus per node).
We have identified that the NIRD mount is unavaialble on Saga and Betzy and are working on finding the cause and putting a fix in place.
28-03-2022-13:20 – Mounts should be back now, the problem was caused by Friday’s maintenance on network gear …
We hope that the above has not caused too much frustration for you guys and we would like to wish a very nice day to everyone !
NRIS HPC staff
There will be a short maintenance stop of Saga and Betzy on Thursday, Feburary 3. at 15:00 CET, due to work on the cooling system in the data hall. The downtime is planned to last for three hours.
During the downtime, no jobs will run, but the login nodes and the /cluster file system will be up. Jobs that cannot finish before 15:00 at February 3, will be left pending in the queue until after the stop.
We are currently conducting various hardware maintennace on Betzy, including reseating infiniband cables. This may cause instabillity and crashed jobs in other parts of the system not directly connected to the cable being reseated.
We apologize for any inconvenience and lost jobs.