[Finished] NIRD Service Platform Maintenance, 22-23 September

Update 2021-09-23: The maintenance is now finished on both sites. Services should be back in production.

Dear users,

We’ll have scheduled maintenance on the NIRD Service platform on 22 and 23 September in order to perform upgrades on the clusters.

In addition to project deployments running on the service platform, the following services are affected during the maintenance:

  • NIRD Toolkit
  • NIRD Archive
  • EasyDMP

The service platform consists of two sites, one in Tromsø and the other in Trondheim. This maintenance will be performed on one site at a time, planned as follows:

22 september: Tromsø
Services running on TOS-SP will be offline. NIRD will be accessible from login-trd.nird.sigma2.no.

23 september: Trondheim

Services running on TRD-SP will be offline. NIRD will be accessible from login-tos.nird.sigma2.no.

To check what site your project is running on, you may log in on the NIRD login-nodes and run the following command: (ssh login.nird.sigma2.no)

readlink /projects/<project number>

Make sure to write the project number in all uppercase.
This will then output the full path to the volume, starting with either “trd” for Trondheim or “tos” for Tromsø.

Example:

[user@login0-nird-trd ~]$ readlink /projects/NS9999K
/tos-project3/NS9999K

The output indicates that this project have it’s primary site in Tromsø (tos-project).

If you have any questions, please do not hesitate to contact us.

[Resolved] login-1.fram crashed – VNC unavailable

One of the login nodes on Fram unexpectedly this morning, causing some users to be disconnected from their sessions.

This also affects the VNC service on Fram. Any attempts on using this service will fail while we’re working on restoring the node.

Updates will be provided once we have more information to share.

Update 13:00 – The node, NIRD exports and VNC service is now back up and running and put back into production. Please let us know if you experience any issues.

We’re very sorry for any inconveniences this may cause.

[Resolved] NIRD Archive unavailable

We regret to inform that the NIRD Archive is currently unavailable due to issues with deployment of the service. A possible cause has been identified and we’re working on resolving it to restore the service.

Updates will be provided as we have new information.

Update 19:47 – The archive is now back up and running.
Update 18:06 – The fix is now properly in place. We’re redeploying the archive web service next.
Update 16:33 – Unfortunately it takes longer than expected to apply the fix. Thank you for your patience!
Update 15:53 – We’re applying a fix right now and expect the archive to be available again shortly.

[Solved] Saga file system performance issue

We’re aware of ongoing issues with the file system performance on Saga and are investigating the cause. This also affects logging in to Saga, where the terminal will hang waiting for a prompt.

Updates will be provided in this post as soon as we have more information to share.

Sorry for the inconvenience.

Update 2021-07-15, 16:33: The issue was identified as a faulty connection between the storage server and the cluster. Performance should be back to normal, but we will monitor the system a bit more before declaring it healthy.
Update 2021-07-14, 15:00: We’ve discovered some faulty drives that are currently being swapped. We hope that the performance will improve once these are in production again.
Update 2021-07-13, 10:03: The file system is a bit more stable now, but we’re still looking into the cause for the degraded performance.

Service not activated in NIRD Service Platform

We regret to inform you that, due to a recent change made by Feide in response to the new national security directives in the sector, you might no longer be able to launch services on the NIRD Toolkit.  The reason is that, from now on, your institution shall approve the services requiring Feide login. If a service is not approved, you cannot access it with your Feide account. Unfortunately, the approval cannot be exercised when the services are deployed dynamically, and on-demand like in the NIRD Toolkit.  

What can I do? 

If you are experiencing a problem with using the NIRD Toolkit, we advise you to email the Feide administrator at your institution with us in CC (sigma2@uninett.no).  

If this takes time and you have an urgent need to use the NIRD Toolkit, there is a workaround (a little cumbersome but only temporary) to mitigate the problem, described here: 

Deploy a service through the NIRD Toolkit – Service not activated 

More information about the changes 
 

You can read more about the changes Feide has made in this article on www.feide.no (in Norwegian).  

We are currently working with Feide to resolve the issue. The solution shall allow automatic approval of all the services deployed through the NIRD Toolkit if the NIRD Toolkit service itself is approved. In the meantime, some organisations have already dealt with this problem by choosing the “Opt-in” option and therefore by approving all Feide Services. This is the temporary solution suggested by Feide and we will contact your organization’s Feide administrator to inform them about this option. 

Please note that Sigma2 was not notified of the changes, and therefore we could not inform you beforehand. Apologies for the inconvenience this may have caused!  

This post will be used to provide updates as we have more information available.

Apologies for the inconvenience this may have caused! 

Slow file system on Saga

We’re experiencing very slow file system on Saga at the moment and are working on identifying the cause.

Update 13:09: The file system is much more responsive now, but we’re still seeing that logins are hanging for ~30 seconds before getting access to the file system. This is being investigated further.

Updates will be provided once we have more information.

Sorry about the inconvenience.

Documentation pages unavailable

Our documentation is currently unavailable due to a larger outage with an upstream provider that our solution is using.

We’re sorry about the inconvenience. Please do not hesitate to contact support if you have any questions.

Update 12:58: The provider have implemented a fix and is currently monitoring the changes. Our documentation is back and available again.

[SOLVED] Problem with logins on Betzy

There are currently an issue with LDAP on Betzy, which means that logins will be rejected.

We’ve identified the cause and are working on resolving the problem.
This post will be updated when we have new information to share.

Sorry about the inconvenience!

Users that have logged in earlier can keep trying to log in, as it should eventually work.
Newly created user accounts unfortunately might not be able to log in before this issue is resolved.

Update 26.03, 12:15 – The problem has been solved now. It should now be possible to log in and run jobs as normal on Betzy.

Update 25.03, 13:45 – Vendor is working on the LDAP issue right now, regular login might be disrupted.

Update 19.03, 13:39 – We’re still looking into this with the vendor, which have escalated the issue. It has been identified that this also affects newly created user accounts on the system, which might not be able to log in at all.
Update 17.03, 16:25 – Unfortuntately the issue still exists. We have contacted the vendor to find a solution as soon as possible.
Update 17.03, 12:20 – No resolution on this just yet, though we have identified a potential cause for the problem and are working on getting a fix implemented.
Update 17.03, 09:51 – We’re seeing an increase in failed logins, though it appears to be a little inconsistent. If you’re experiencing this, trying again should work in most cases. We are investigating the cause of these issues.
Update 16.03, 10:26 – The problem is now solved and we’ll monitor the fix throughout the day.