Betzy file system issue [ONGOING]

We are experiencing some issues with the Betzy file system. Please check back here for updates.

UPDATE 04-08-2021: The problem is still not solved, and the performance and accessibility of the file system varies a lot. Also some jobs crash. We are in contact with the vendors about this.

UPDATE 2 02-08-2021: Login very slow/hanging again. Possibly other activities affected as well.

UPDATE 02-08-2021: Login to all frontends was working smoothly this morning. We are still watching the file system in case the problem reoccurs.

UPDATE 28-07-2021: we are still investigating the issue(s), the system might be unstable until the beginning of the next week.

We are very sorry for the inconvenience this might cause to you.

[Solved] Saga file system performance issue

We’re aware of ongoing issues with the file system performance on Saga and are investigating the cause. This also affects logging in to Saga, where the terminal will hang waiting for a prompt.

Updates will be provided in this post as soon as we have more information to share.

Sorry for the inconvenience.

Update 2021-07-15, 16:33: The issue was identified as a faulty connection between the storage server and the cluster. Performance should be back to normal, but we will monitor the system a bit more before declaring it healthy.
Update 2021-07-14, 15:00: We’ve discovered some faulty drives that are currently being swapped. We hope that the performance will improve once these are in production again.
Update 2021-07-13, 10:03: The file system is a bit more stable now, but we’re still looking into the cause for the degraded performance.

FRAM – controller maintenance

Good morning,

we are going to perform some routine maintenance on one of the file system controllers of FRAM. This should have no significant implications for production, users might experience slightly degraded Lustre (file system) performance.

The operation is scheduled for today – 11 a.m. …

Update 8.07: There were also performance issues with the login nodes. This and the controller maintenance is now finished.

FRAM – Unexpected shutdown

We are experiencing some troubles with FRAM machine. Yesterday morning (Sunday 04.07.2021) there were many compute nodes that went unexpectedly down. We are investigating the issue.

Update 05.07.2021 – 10:54: The shutdown was caused by a power outage in the data center. We are taking all nodes up and monitoring their behavior.

Apologies for the inconvenience this may have caused! 

Service not activated in NIRD Service Platform

We regret to inform you that, due to a recent change made by Feide in response to the new national security directives in the sector, you might no longer be able to launch services on the NIRD Toolkit.  The reason is that, from now on, your institution shall approve the services requiring Feide login. If a service is not approved, you cannot access it with your Feide account. Unfortunately, the approval cannot be exercised when the services are deployed dynamically, and on-demand like in the NIRD Toolkit.  

What can I do? 

If you are experiencing a problem with using the NIRD Toolkit, we advise you to email the Feide administrator at your institution with us in CC (sigma2@uninett.no).  

If this takes time and you have an urgent need to use the NIRD Toolkit, there is a workaround (a little cumbersome but only temporary) to mitigate the problem, described here: 

Deploy a service through the NIRD Toolkit – Service not activated 

More information about the changes 
 

You can read more about the changes Feide has made in this article on www.feide.no (in Norwegian).  

We are currently working with Feide to resolve the issue. The solution shall allow automatic approval of all the services deployed through the NIRD Toolkit if the NIRD Toolkit service itself is approved. In the meantime, some organisations have already dealt with this problem by choosing the “Opt-in” option and therefore by approving all Feide Services. This is the temporary solution suggested by Feide and we will contact your organization’s Feide administrator to inform them about this option. 

Please note that Sigma2 was not notified of the changes, and therefore we could not inform you beforehand. Apologies for the inconvenience this may have caused!  

This post will be used to provide updates as we have more information available.

Apologies for the inconvenience this may have caused! 

Betzy: Network problems

[2021-06-23 14:26] The issue is now solved and the jobs have now started to run. Please report if you experience any further issues to support@metacenter.no

[2021-06-23 09:20] We are again experiencing problems on Betzy. We will update here when we’ve solved the issue.

[2021-06-22 11:15] The problem has been located and fixed, and Betzy should work as normal again.

[2021-06-22 09:30] We are currently experiencing network problems on Betzy. We don’t know the full extent of it, but it is at least affecting the queue system, so all Slurm-related commands are hanging.

We are investigating, and will update when we know more.

[DONE] Saga Maintenance Stop 23–24 June

[2021-06-25 08:45] The maintenance stop is now over, and Saga is back in full production. There is a new version of Slurm (20.11.7), and storage on /cluster has been reorganised. This should be largely invisible, except that we will simplify the dusage command output to only show one set of quotas (pool 1).

[2021-06-25 08:15] Part of the file system reorganisation took longer than anticipated, but we will start putting Saga back into production now.

[2021-06-23 12:00] The maintenance has now started.

[UPDATE: The correct dates are June 23–24, not July]

There will be a maintenance stop of Saga starting June 23 at 12:00. The stop is planned to last until late June 24.

During the stop, the queue system Slurm will be upgraded to the latest version, and the /cluster file system storage will be reorganised so all user files will be in one storage pool. This will simplify disk quotas.

All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.

Slow file system on Saga

We’re experiencing very slow file system on Saga at the moment and are working on identifying the cause.

Update 13:09: The file system is much more responsive now, but we’re still seeing that logins are hanging for ~30 seconds before getting access to the file system. This is being investigated further.

Updates will be provided once we have more information.

Sorry about the inconvenience.