Please cleanup data on Saga!

Dear users on Saga,

currently, usage on Saga’s parallel file system (everything under /cluster) is at about 93%. Already, some of the file system servers are not accepting new data. If usage increases even further, soon the performance of the parallel file system may drop significantly, then some users may experience data loss and finally the whole cluster may come to a complete halt.


Therefore, we are kindly asking all users with large usage (check with the command dusage) to cleanup unneeded data. Please, check all storage locations you’re storing data, that is, $HOME$USERWORK, project folders (/cluster/projects/...) and shared folders (/cluster/shared/...). Particularly, we’re asking users whose $HOME quota is not (yet) enforced (see line with $HOME in example below) to reduce their usage as soon as possible. Quota for $HOME if set is 20 GiB.

[saerda@login-3.SAGA ~]$ dusage -u saerda
Block quota usage on: SAGA
File system   User/Group   Usage   SoftLimit     HardLimit 
saerda_g  $HOME             6.9 TiB 0 Bytes     0 Bytes
saerda    saerda (u)        2.8 GiB    0 Bytes       0 Bytes

In parallel, we are trying to help users to reduce their usage and to increase the capacity of the file system, but these measures usually take time.

Many thanks in advance!

Change to the “optimist” jobs

The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.

(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)

Saga: file system issues

We’re currently having some issues with the storage backend on Saga. Users will experience a hanging prompt on the login nodes and when attempting to connect to them. We’re actively working on resolving these issues and apologize for the inconvenience.

UPDATE 2020-07-19 13:00: The issues on Saga has been resolved and we are resuming normal operations.

UPDATE 2020-07-09 13:20: We needed to reboot a part of the storage system to mitigate the file system issues. For now, we’re monitoring the situation and will send an update tomorrow. Users are advised to check results/jobs that ran from about midnight to noon today, however, we do not recommend rescheduling or submitting new jobs for now. Login nodes should be functional.

tos-project3 on NIRD is read only

Due to underlying hardware issues, tos-project3 filesystem is set to READ-ONLY while we investigate the issue.

These are the projects affected:

NN9999K
NS1002K
NS4704K
NS9001K
NS9012K
NS9014K
NS9033K
NS9054K
NS9063K
NS9066K
NS9114K
NS9191K
NS9320K
NS9404K
NS9518K
NS9602K
NS9615K
NS9641K
NS9672K
NS0000K
NS1004K
NS9000K
NS9003K
NS9013K
NS9021K
NS9035K
NS9060K
NS9064K
NS9081K
NS9133K
NS9305K
NS9357K
NS9478K
NS9560K
NS9603K
NS9616K
NS9655K
NS9999K

Maintenance on NIRD, NIRD Toolkit and Fram , 20th April -24th April

23 April – 18:50 NIRD and the NIRD toolkit services are now back into production

24th April: Fram is back in production.

WARNING: MAINTENANCE IS CURRENTLY ONGOING!

Dear NIRD, NIRD Toolkit, and Fram User,

We will have a four day long scheduled maintenance on NIRD, NIRD Toolkit and Fram starting on the 20th of April, 09:00 AM.

Running HPC jobs and logging in to Saga is NOT affected.
NIRD connectivity, and backup of files, from Saga IS affected

During the maintenance we will:

  • carry out software and firmware updates on all systems

Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram and Saga too.

Login services to NIRD, NIRD-toolkit and Fram will be disabled during the maintenance

Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.

Please accept our apologies for the inconvenience this downtime is causing.

Metacenter Operations

Saga – poor file system performance

The parallel file system on Saga is currently under a lot of stress caused by the running jobs.

We are working on optimizing and speed up the file system together with the vendor.
In the mean time we kindly ask you to follow the guidlines listed at our documentation pages.

As a general rule:

  • file system performance decreases as the file system usage grows
  • the number of I/O operations is directly influencing the responsiveness of the file system
  • disk operations are with factor of thousand more expensive than memory operations
  • the higher the number of files, the slower the I/O is

Thank you for your understanding!

Metacenter Operations

$USERWORK auto-cleanup on Saga

Dear Saga User,

The usage of the /cluster file system on Saga has now bypassed 60%. To maintain the file system as responsive as possible, we have to periodically decrease the number of files, free up space and enforce automatic deletion of temporary files.

Starting with Wednesday, 19th of February we are going to activate the automatic cleanup of the $USERWORK (/cluster/work) area as documented here.

The retention period is:

  • 42 days below 70% file system usage
  • 21 days when file system usage reaches 70%.

Files older then the active retention period will be automatically deleted.
You can read more information about the storage areas on HPC clusters here and here.

Please copy all your important data from $USERWORK to your project area to avoid data loss.

Thank you for your understanding!

Metacenter Operations