Betzy file system issue [ONGOING]

We are experiencing some issues with the Betzy file system. Please check back here for updates.

UPDATE 04-08-2021: The problem is still not solved, and the performance and accessibility of the file system varies a lot. Also some jobs crash. We are in contact with the vendors about this.

UPDATE 2 02-08-2021: Login very slow/hanging again. Possibly other activities affected as well.

UPDATE 02-08-2021: Login to all frontends was working smoothly this morning. We are still watching the file system in case the problem reoccurs.

UPDATE 28-07-2021: we are still investigating the issue(s), the system might be unstable until the beginning of the next week.

We are very sorry for the inconvenience this might cause to you.

Possible SAGA downtime 2021-04-28

UPDATE, 2021-04-27: As expected and hoped, there was no need for the downtime, so we have now removed the maintenance reservation. Jobs will start as normal. The expansion is progressing fine, and we hope to be able to put the new nodes into production this week.

Saga will be expanded with new compute nodes in week 17 (April 26. — 30.). According to the vendor, there should not be a need for a stop during the expansion, but in case it turns out that we will have to stop Saga, we will have the stop at Wednesday April 28. at 08:00.

We have set up a maintenance reservation of all of Saga starting 2021-04-28 at 08:00, so that we will not have to cancel running jobs, if we need to stop Saga. This means that submitted jobs with an end time after this slot will be held back. The reservation will be removed as soon as we know that there will be no stop.

SAGA – Email service from jobs disabled temporarily.

On Thursday, 2021-02-11, a user submitted an array job with an email address specified. Our computer then sent an email to him when one of the many hundreds of jobs started and when it finished. His e-mail server only allowed him to receive 500 emails per day, so after he had reached that limit an “Undelivered Mail” message was sent to us at support, twice for each of his hundreds of jobs. Each of these “Undelivered Mail” e-mails created a new case in our support system. The user did, in principle, nothing wrong, but as a result our support system was completely shut down. Until we find a permanent solution for this problem, as a temporary fix we have now disabled the email service.

Fram file system issues

11:30 15-09-2020 [Update 7]: Quick heads-up: We are trying to put one of the storage servers back into production. This could result in some users/jobs experiencing some short hangs. If you are in doubt about the behaviour of your jobs, please, do not hesitate to contact us at support@metacenter.no.

14:30 14-09-2020 [Update 6]: Most compute nodes are running now with the old lustre client. So, what regards the most recent issues, it should be safe to submit jobs. Unfortunately, this also means that the «hung io-wait issue» may happen again. Just contact us via support@metacenter.no in case you continue to have file system issues.

12:15 14-09-2020 [Update 5]: We found the reason for the behaviour many users have reported (problems with the module system, crashes, etc). It seems the new file system client causes this. So, the only immediate “solution” is to go back to the old version of the client. This may cause other issues, however, they are less severe than what we see now. We will inform here if it is safe to submit jobs.

10:30 14-09-2020 [Update 4]: Over the weekend, on the majority of compute nodes the lustre client for the parallel file system was updated. However, users are still reporting issues, particularly, when loading modules. It seems that the module system is not configured correctly on the updated nodes. We are looking into fixing the issue and keep you up-to-date here.

Sorry for the inconvenience!

15:00 11-09-2020[Update 3]: We are currently upgrading lustre filesystem clients to mitigate a «hung io-wait issue». We are also at reduced capacity performance-wice as one of eight io-servers are down. Full production is to be expected from Monday morning. A small hang is expected when io-server i phased in. We expect hung io-wait to go away during next two weeks as clients are upgraded

20:50 10-09-2020[Update 2] : Sorry to inform that we are still having some issues and vendor has been contacted

13:15 10-09-2020[Update 1] : The file system is partially back in operation. Which means you may use Fram but the performance will be sub-optimal. Some jobs may be affected when we try to bring back a object storage latter today.

08:15 10-09-2020 : We are experiencing some issues with the Fram file system and working on fix. Sorry for the inconvenience.

Issues with job completion – FIXED

Update : 14:39 26-07-18 The issue with Fram file system is now fixed and jobs should run as normal.

We are experiencing some problems at the moment and this is most likely a file system issue. We are trying our best to bring the services back to normal, however as most of the experts are on holiday this may take longer than usual. Please check back here for updates.