The queue system configuration of the GPU nodes on Betzy had an error: The number of CPUs were set to 128 instead of 64. Most jobs would probably not be affected by this, but it is possible that some jobs got sub-optimal cpu pinnings.
This has now been fixed, and the documentation updated. There is nothing users have to do with their job scripts (except if they asked for more than 64 cpus per node).
We have identified that the NIRD mount is unavaialble on Saga and Betzy and are working on finding the cause and putting a fix in place.
28-03-2022-13:20 – Mounts should be back now, the problem was caused by Friday’s maintenance on network gear …
We hope that the above has not caused too much frustration for you guys and we would like to wish a very nice day to everyone !
NRIS HPC staff
There will be a short maintenance stop of Saga and Betzy on Thursday, Feburary 3. at 15:00 CET, due to work on the cooling system in the data hall. The downtime is planned to last for three hours.
During the downtime, no jobs will run, but the login nodes and the /cluster file system will be up. Jobs that cannot finish before 15:00 at February 3, will be left pending in the queue until after the stop.
We are currently conducting various hardware maintennace on Betzy, including reseating infiniband cables. This may cause instabillity and crashed jobs in other parts of the system not directly connected to the cable being reseated.
We apologize for any inconvenience and lost jobs.
[UPDATE, 2021-12-15 15:00] Betzy is back in prodcution again.
[UPDATE, 2021-12-15 09:00] The downtime has now started.
There will be a short downtime for Betzy next Wednesday 15th from 09:00 until 15:00 to fix remaining hardware issues.
The downtime has started and will continue until wednesday 8th December evening or until upgrades are done.
We are currently experiencing issues with the login nodes on Betzy, and we are working on solving it.
Update: The nodes should now be working as normal
There will be a scheduled downtime for Betzy lasting three days starting on Monday 6th December at 08:00. Downtime will last until Thursday 9th, 20:00.
During the downtime we will conduct:
- Full upgrade of the Lustre filesystem (both servers and clients)
- Full upgrade of the infiniband firmware
- Full upgrade of the Mellanox infiniband drivers
- minor updates to other parts of the system (Slurm, configs, etc)
Please be aware that this does also affect the storage services recently moved from NIRD to Betzy.
We apologize for the inconvenience
Update 08.12.2021 18:00 : Betzy downtime is over, and system is open for users. All planned update is performed .
There is currently an issue on Betzy with the batch system which results in jobs not completing and new jobs not being started.
We are currently investigating the issue and will update once we know what caused it and how it can be resolved.
[Update 14:22]: Job submission is working again. The users experiencing this were unfortunately victims of a batch system restart which happened at the same time as the job was submitted.
Our vendor will perform maintenance on Betzy on Tuesday 19th October, starting at 10:00, to various DNS issues and internal database issues related to GPU nodes.
The maintenance should not affect jobs, and you can use Betzy as normal during the work. There might be short periods where hostname lookups will not function properly.