Betzy: Network problems

[2021-06-23 14:26] The issue is now solved and the jobs have now started to run. Please report if you experience any further issues to support@metacenter.no

[2021-06-23 09:20] We are again experiencing problems on Betzy. We will update here when we’ve solved the issue.

[2021-06-22 11:15] The problem has been located and fixed, and Betzy should work as normal again.

[2021-06-22 09:30] We are currently experiencing network problems on Betzy. We don’t know the full extent of it, but it is at least affecting the queue system, so all Slurm-related commands are hanging.

We are investigating, and will update when we know more.

[SOLVED] Betzy Downtime 7th June 15:00-20:00

[UPDATE, 2021-06-08 08:00] Betzy is now up and in production again.

[UPDATE] Unfortunately, the downtime is taking longer than anticipated, and will not be finished tonight. We plan on getting Betzy up again at around 08:00 tomorrow morning.

Campusservice at NTNU will conduct maintenance on the High Voltage circuits for Non-redundant power on 7th of June 2021, between 15:00 and 20:00. All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.

[SOLVED] Problem with logins on Betzy

There are currently an issue with LDAP on Betzy, which means that logins will be rejected.

We’ve identified the cause and are working on resolving the problem.
This post will be updated when we have new information to share.

Sorry about the inconvenience!

Users that have logged in earlier can keep trying to log in, as it should eventually work.
Newly created user accounts unfortunately might not be able to log in before this issue is resolved.

Update 26.03, 12:15 – The problem has been solved now. It should now be possible to log in and run jobs as normal on Betzy.

Update 25.03, 13:45 – Vendor is working on the LDAP issue right now, regular login might be disrupted.

Update 19.03, 13:39 – We’re still looking into this with the vendor, which have escalated the issue. It has been identified that this also affects newly created user accounts on the system, which might not be able to log in at all.
Update 17.03, 16:25 – Unfortuntately the issue still exists. We have contacted the vendor to find a solution as soon as possible.
Update 17.03, 12:20 – No resolution on this just yet, though we have identified a potential cause for the problem and are working on getting a fix implemented.
Update 17.03, 09:51 – We’re seeing an increase in failed logins, though it appears to be a little inconsistent. If you’re experiencing this, trying again should work in most cases. We are investigating the cause of these issues.
Update 16.03, 10:26 – The problem is now solved and we’ll monitor the fix throughout the day.

Betzy: Ongoing problems with srun and Infiniband

Betzy is experiencing issues with srun and Infiniband that now and again affect users. The srun problem first appeared last week, the Infiniband problem has been there longer.

Symptoms for srun problem are of a type related to send/recv operation messages. If you see such error-messages you are probably affected by the problem. A workaround can be to just keep trying to run the srun job until it succeeds.

Symptoms for the Infiniband problem are messages of type Transport retry count exceeded. At the moment there is no workaround for this problem.

The expert-team is working on solving the problems together with the hardware vendors. We are very sorry for the inconvenience, but can assure you that the team is working hard on solving this.

Update 15:00 05.02: Regarding to srun problem, we have eliminated the issue, and srun should work as expected. Please report to us if you see any problem related to srun. Infiniband problem is most probably due to bad nodes, we are still working on the issue, and we are seeing less problem related to infinband as we isolated bad nodes.

Update 14:00 19.02: Issues with InfiniBand have been resolved as well. If there are any issues, please report.

Betzy pre-production

Dear HPC User,

We are pleased to announce that Betzy is opened for pre-production Friday 20 November.

Being close to the weekend, Betzy is opened stepwise. First to prior pilot projects and then for general access Tuesday 24 November.

It has been a long journey, but we are happy to see good performance and stability on the system.

Please note, that during the coming days, changes will be made to the queue system setup, which could necessitate the cancelling of running jobs.

Finally, support will be also offered only from 24 November.

Thank you for your patience and we wish you happy computing!

Best regards,

Lorand Szentannai, on behalf of the preparations team

Updated information about Betzy production

Dear HPC User,

As mentioned previous week, the validation benchmarks have been stable, and we were ready to run and evaluate the site acceptance test. Unfortunately, the interconnect stability issues reoccured once again. 

We and the vendor have been running extensive tests since. The R&D department from the vendor of the interconnect released a new firmware yesterday afternoon, which was applied already yesterday evening and stress-tests immediately started. In order to be sure that the problem is resolved, several days of testing is needed.

Therefore, we have to postpone the production yet again with a week. Current production estimate is end of week 47.

We can assure you that we are very eager to have the system 100% stabilized and in production and everybody involved in the project (be it from Sigma2, the Metacenter, or vendor) is working intensively with this.

Thank you for your understanding!

Best regards,

Lorand Szentannai, on behalf of the preparations team

Information regarding Betzy production

Dear HPC User,

Our previous estimate of production on Betzy has proved to be somewhat optimistic. 

With the help of the vendor, we believe we have identified and fixed the cause of the interconnect stability problem on Betzy. The most recent validation benchmarks have been stable, and we will begin the site acceptance test (SAT) within Friday, 6 November. If the machine passes the SAT, it will be handed over to the operations and opened for production. 

The final preparations usually take 1-3 days. We therefore estimate that production will begin on Betzy within next week, week 46.

Best regards,

Lorand Szentannai, on behalf of the preparations team