There will be a scheduled downtime for Betzy lasting three days starting on Monday 6th December at 08:00. Downtime will last until Thursday 9th, 08:00.
During the downtime we will conduct:
- Full upgrade of the Lustre filesystem (both servers and clients)
- Full upgrade of the infiniband firmware
- Full upgrade of the Mellanox infiniband drivers
- minor updates to other parts of the system (Slurm, configs, etc)
Please be aware that this does also affect the storage services recently moved from NIRD to Betzy.
We apologize for the inconvenience
There is currently an issue on Betzy with the batch system which results in jobs not completing and new jobs not being started.
We are currently investigating the issue and will update once we know what caused it and how it can be resolved.
[Update 14:22]: Job submission is working again. The users experiencing this were unfortunately victims of a batch system restart which happened at the same time as the job was submitted.
Our vendor will perform maintenance on Betzy on Tuesday 19th October, starting at 10:00, to various DNS issues and internal database issues related to GPU nodes.
The maintenance should not affect jobs, and you can use Betzy as normal during the work. There might be short periods where hostname lookups will not function properly.
We have halted the scheduling of new jobs on Betzy due to a pending config change. RUnning jobs will continue, and scheduling of new jobs will continue once the config change is applied.
We are having some DNS problems with Betzy. This could result in software licenses servers not resolving or issues downloading data from the internet. The issue is being investigated.
We are very sorry for any inconvenience this might cause to you.
We are currently experiencing degraded performance for the Betzy file system. We are investigating together with vendor to find the culprit for this issue
[2021-06-23 14:26] The issue is now solved and the jobs have now started to run. Please report if you experience any further issues to firstname.lastname@example.org
[2021-06-23 09:20] We are again experiencing problems on Betzy. We will update here when we’ve solved the issue.
[2021-06-22 11:15] The problem has been located and fixed, and Betzy should work as normal again.
[2021-06-22 09:30] We are currently experiencing network problems on Betzy. We don’t know the full extent of it, but it is at least affecting the queue system, so all Slurm-related commands are hanging.
We are investigating, and will update when we know more.
[UPDATE, 2021-06-08 08:00] Betzy is now up and in production again.
[UPDATE] Unfortunately, the downtime is taking longer than anticipated, and will not be finished tonight. We plan on getting Betzy up again at around 08:00 tomorrow morning.
Campusservice at NTNU will conduct maintenance on the High Voltage circuits for Non-redundant power on 7th of June 2021, between 15:00 and 20:00. All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.
We have identified a hardware issue with login-2.betzy.sigma2.com
This means there is reduced capacity on betzy login nodes until further notice.
This will probably not be fixed until after Easter.
There are currently an issue with LDAP on Betzy, which means that logins will be rejected.
We’ve identified the cause and are working on resolving the problem.
This post will be updated when we have new information to share.
Sorry about the inconvenience!
Users that have logged in earlier can keep trying to log in, as it should eventually work.
Newly created user accounts unfortunately might not be able to log in before this issue is resolved.
Update 26.03, 12:15 – The problem has been solved now. It should now be possible to log in and run jobs as normal on Betzy.
Update 25.03, 13:45 – Vendor is working on the LDAP issue right now, regular login might be disrupted.
Update 19.03, 13:39 – We’re still looking into this with the vendor, which have escalated the issue. It has been identified that this also affects newly created user accounts on the system, which might not be able to log in at all.
Update 17.03, 16:25 – Unfortuntately the issue still exists. We have contacted the vendor to find a solution as soon as possible.
Update 17.03, 12:20 – No resolution on this just yet, though we have identified a potential cause for the problem and are working on getting a fix implemented.
Update 17.03, 09:51 – We’re seeing an increase in failed logins, though it appears to be a little inconsistent. If you’re experiencing this, trying again should work in most cases. We are investigating the cause of these issues.
Update 16.03, 10:26 – The problem is now solved and we’ll monitor the fix throughout the day.