[SOLVED] Problem with logins on Betzy

There are currently an issue with LDAP on Betzy, which means that logins will be rejected.

We’ve identified the cause and are working on resolving the problem.
This post will be updated when we have new information to share.

Sorry about the inconvenience!

Users that have logged in earlier can keep trying to log in, as it should eventually work.
Newly created user accounts unfortunately might not be able to log in before this issue is resolved.

Update 26.03, 12:15 – The problem has been solved now. It should now be possible to log in and run jobs as normal on Betzy.

Update 25.03, 13:45 – Vendor is working on the LDAP issue right now, regular login might be disrupted.

Update 19.03, 13:39 – We’re still looking into this with the vendor, which have escalated the issue. It has been identified that this also affects newly created user accounts on the system, which might not be able to log in at all.
Update 17.03, 16:25 – Unfortuntately the issue still exists. We have contacted the vendor to find a solution as soon as possible.
Update 17.03, 12:20 – No resolution on this just yet, though we have identified a potential cause for the problem and are working on getting a fix implemented.
Update 17.03, 09:51 – We’re seeing an increase in failed logins, though it appears to be a little inconsistent. If you’re experiencing this, trying again should work in most cases. We are investigating the cause of these issues.
Update 16.03, 10:26 – The problem is now solved and we’ll monitor the fix throughout the day.

Saga maintenance finished

The Saga maintenance is finished and you are welcome to start computing again. We have added considerable amount of storage. In the following weeks we will reorganize the placement of some data to utilize the new disks. We were not able to add all of the expansion due to some hardware errors, but this will be added later.

[login-1.SAGA ~]# df -h /cluster
Filesystem Size Used Avail Use% Mounted on
beegfs_nodev 5,3P 743T 4,5P 14% /cluster

SAGA – Email service from jobs disabled temporarily.

On Thursday, 2021-02-11, a user submitted an array job with an email address specified. Our computer then sent an email to him when one of the many hundreds of jobs started and when it finished. His e-mail server only allowed him to receive 500 emails per day, so after he had reached that limit an “Undelivered Mail” message was sent to us at support, twice for each of his hundreds of jobs. Each of these “Undelivered Mail” e-mails created a new case in our support system. The user did, in principle, nothing wrong, but as a result our support system was completely shut down. Until we find a permanent solution for this problem, as a temporary fix we have now disabled the email service.

Saga Maintenenace Downtime 18th-19th February 2021

Dear Saga user. We need to schedule a small downtime of Saga from Thursday 18th February 2021 starting at 08:00, until Friday 19th February 2021, ending at 16:00. If maintenance is finished earlier, we will also open up the machine earlier.

During the maintenance, we will continue the work we started in early december and connect the new storage expansion to the system, and make it available for general use. This will give us several Petabyte extra for project storage and other usage..

We apologize for the inconvenience

Betzy: Ongoing problems with srun and Infiniband

Betzy is experiencing issues with srun and Infiniband that now and again affect users. The srun problem first appeared last week, the Infiniband problem has been there longer.

Symptoms for srun problem are of a type related to send/recv operation messages. If you see such error-messages you are probably affected by the problem. A workaround can be to just keep trying to run the srun job until it succeeds.

Symptoms for the Infiniband problem are messages of type Transport retry count exceeded. At the moment there is no workaround for this problem.

The expert-team is working on solving the problems together with the hardware vendors. We are very sorry for the inconvenience, but can assure you that the team is working hard on solving this.

Update 15:00 05.02: Regarding to srun problem, we have eliminated the issue, and srun should work as expected. Please report to us if you see any problem related to srun. Infiniband problem is most probably due to bad nodes, we are still working on the issue, and we are seeing less problem related to infinband as we isolated bad nodes.

Update 14:00 19.02: Issues with InfiniBand have been resolved as well. If there are any issues, please report.

[Updated] Fram Downtime on Wednesday 20th January 2021 from 12:00-15:00

Update: The file system servers have now been fixed, and we are back online again. Thank you for your patience.

We have an ongoing performance issue with Fram filesystem. We need to shut down file servers to get this fixed, and therefore need to have three hours downtime:

Wednesday 20th January between 12:00 and 15:00, Fram will be unavailable