login-2 on Saga is down

login-2,saga.sigma2.no is currently down duee two several faulty memory dimms. This also affects the use and functionality slurm browser and desktop.saga.sigma2.no

We hope to have the dimms replaced some time during the week.

Maintenance on NIRD, NIRD Toolkit, Fram and Saga, 20th April -24th April

Dear NIRD, NIRD Toolkit, Fram and Saga User,

We will have a four day long scheduled maintenance on NIRD, NIRD Toolkit, Fram and Saga starting on the 20th of April, 09:00 AM.

During the maintenance we will:

  • carry out software and firmware updates on all systems

Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram and Saga too.

Login services to NIRD, NIRD-toolkit, Fram and Saga will be disabled during the maintenance

Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.

Please accept our apologies for the inconvenience this downtime is causing.

Metacenter Operations

Saga – poor file system performance

The parallel file system on Saga is currently under a lot of stress caused by the running jobs.

We are working on optimizing and speed up the file system together with the vendor.
In the mean time we kindly ask you to follow the guidlines listed at our documentation pages.

As a general rule:

  • file system performance decreases as the file system usage grows
  • the number of I/O operations is directly influencing the responsiveness of the file system
  • disk operations are with factor of thousand more expensive than memory operations
  • the higher the number of files, the slower the I/O is

Thank you for your understanding!

Metacenter Operations

$USERWORK auto-cleanup on Saga

Dear Saga User,

The usage of the /cluster file system on Saga has now bypassed 60%. To maintain the file system as responsive as possible, we have to periodically decrease the number of files, free up space and enforce automatic deletion of temporary files.

Starting with Wednesday, 19th of February we are going to activate the automatic cleanup of the $USERWORK (/cluster/work) area as documented here.

The retention period is:

  • 42 days below 70% file system usage
  • 21 days when file system usage reaches 70%.

Files older then the active retention period will be automatically deleted.
You can read more information about the storage areas on HPC clusters here and here.

Please copy all your important data from $USERWORK to your project area to avoid data loss.

Thank you for your understanding!

Metacenter Operations

NIRD project file systems mounted on Saga

Dear Saga User,

We have the pleasure to announce that we have now fixed all the technical requirements and mounted NIRD project file systems on Saga login nodes.

You may find your projects in the

/nird/projects/nird

folder.

Please note that to transfer of large amount of files is sluggish and has a big impact on the I/O performance. It is always better to transfer one larger file than many small files.
As an example, transfer of a folder with 70k entries and about 872MB took 18 minutes, while the same files archived into a single 904MB file took 3 seconds.

You can read more about the tar archiving command by reading the manual pages. Type

man tar

in your Saga terminal.

Metacenter Operations

[SOLVED] Saga: /cluster filesystem problem

/cluster filesystem on Saga is crashed, we are working on it. users should expect that their Slurm jobs will crash.

Update, 09:45: The file system is back online now. Only parts of /cluster was unavailable, but we recommend you to check your jobs, some of them will probably have crached.

UPDATED: Problem with Access to Projects

Quite a few users have lost access to their project(s) on Nird and all clusters during the weekend. This was due to a bug in the user administration software. The bug has been identified, and we are working on rolling back the changes.

We will update this page when access has been restored.

Update 12:30: Problem resolved. Project access has now been restored. If you still have problems, please contact support at sigma@uninett.no

Update: This applies to all systems, not only Fram and Saga.

[UPDATE] Saga: /cluster filesystem problem

Dear Saga cluster Users: 
We have discovered /cluster filesystem issue on Saga, which can lead to possible data corruption, to be able to examine the problem, we decided to suspend all running jobs on Saga and reserve entire cluster. No new job will be accepted until problem is resolved. Users can still login to Saga login nodes. 
We are sorry for any inconvenience this may have caused.
We will keep you updated as we progress.

Update: We are trying to repair the file system without killing all jobs. It might not work, at least not for all jobs. In the mean time, we have closed access to the login nodes to avoid more damage to the file system.

Update 14:15: Problem resolved, Saga is open again. Please check if you have running jobs, some of the jobs could get crashed.
The source of the problem is related to the underlying filesystem (XFS) and the current kernel that we are running. We scanned the underlying filesystem on our OSS servers to eliminate possible data corruption on /cluster filesystem, and we also updated kernel on OSS’es.

Please don’t hesitate to contact us if you have any questions

Network outage

Update

  • 2020-01-13 14:54: Problems have been sorted out now and network is functional again.
  • 2020-01-13 14:40: Problems are unfortunately back again. Uninett’s network specialists are working on solving the problem as soon as possible.
  • 2020-01-13 14:22: Network is functional again. Apologies for the inconvenience it has caused.

We are currently experiencing network outage on Saga and some parts of NIRD. The problem is under investigation.

Please check back here for an update on this matter.

Metacenter Operations

Saga login nodes will be rebooted tonight

To improve performance of the /cluster file system, we will reboot the Saga login nodes this evening. We apologize for the short notice, but expect the increased performance to make up for any inconvenience.

Jobs in the queue system will not be affected.