[UPDATE] Saga: /cluster filesystem problem

Dear Saga cluster Users: 
We have discovered /cluster filesystem issue on Saga, which can lead to possible data corruption, to be able to examine the problem, we decided to suspend all running jobs on Saga and reserve entire cluster. No new job will be accepted until problem is resolved. Users can still login to Saga login nodes. 
We are sorry for any inconvenience this may have caused.
We will keep you updated as we progress.

Update: We are trying to repair the file system without killing all jobs. It might not work, at least not for all jobs. In the mean time, we have closed access to the login nodes to avoid more damage to the file system.

Update 14:15: Problem resolved, Saga is open again. Please check if you have running jobs, some of the jobs could get crashed.
The source of the problem is related to the underlying filesystem (XFS) and the current kernel that we are running. We scanned the underlying filesystem on our OSS servers to eliminate possible data corruption on /cluster filesystem, and we also updated kernel on OSS’es.

Please don’t hesitate to contact us if you have any questions

NIRD and NIRD Toolkit scheduled maintenance

Dear NIRD and NIRD Toolkit User,

We will have a three day long scheduled maintenance on NIRD and NIRD Toolkit starting on the 20th of January, 09:00 AM.

During the maintenance we will:

  • carry out software and firmware updates,
  • change geo-locality for some of the projects,
  • replace synchronization mechanisms,
  • depending on part delivery times from disk vendor – expand the storage and quotas.

Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram too.

Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.

Please accept out apologies for the inconvenience this downtime is causing.

Metacenter Operations

Backup policy changes

Dear Fram and Saga User,

As you may remember quotas have been enforced on $HOME areas during October past year. This has been carried out only for users having less then 20GiB in their Fram or Saga home folders.

Because of repeated space related issues on NIRD home folders and due to the backups taken from either Fram or Saga, we had to change our backup policies and exclude backup for users using more then the 20GiB in their Fram or Saga $HOME areas.

If you manage to clean up your $HOME on Fram and/or Saga and decrease your $HOME usage below 20GiB, we can then enforce quotas on it and re-enable the backup.

Thank you for your understanding!

Metacenter Operations

Network outage

Update

  • 2020-01-13 14:54: Problems have been sorted out now and network is functional again.
  • 2020-01-13 14:40: Problems are unfortunately back again. Uninett’s network specialists are working on solving the problem as soon as possible.
  • 2020-01-13 14:22: Network is functional again. Apologies for the inconvenience it has caused.

We are currently experiencing network outage on Saga and some parts of NIRD. The problem is under investigation.

Please check back here for an update on this matter.

Metacenter Operations

Saga login nodes will be rebooted tonight

To improve performance of the /cluster file system, we will reboot the Saga login nodes this evening. We apologize for the short notice, but expect the increased performance to make up for any inconvenience.

Jobs in the queue system will not be affected.

Fram poweroutage

Fram lost power Saturday january 4th 2020, at 21:00. The power company Tromskraft is working to get the power back.

update: 2020-04-01 23:30 The power is back and we will get the majority of the nodes up as soon as possible.