[RESOLVED] Saga downtime. 7th December-11th December. Adding 4PB storage.

As previously announced, Saga will be down in the coming week, from 7th December 08:00 until 11th December 16:00.

The downtime is allocated for expanding the storage. When we come back we will have ca 4 Petabyte in addition to the already existing 1 PetaByte.

Update: Saga is back online and running jobs again. The new storage is not online yet, but all the hardware has been mounted.

Please cleanup data on Saga!

Dear users on Saga,

currently, usage on Saga’s parallel file system (everything under /cluster) is at about 93%. Already, some of the file system servers are not accepting new data. If usage increases even further, soon the performance of the parallel file system may drop significantly, then some users may experience data loss and finally the whole cluster may come to a complete halt.


Therefore, we are kindly asking all users with large usage (check with the command dusage) to cleanup unneeded data. Please, check all storage locations you’re storing data, that is, $HOME$USERWORK, project folders (/cluster/projects/...) and shared folders (/cluster/shared/...). Particularly, we’re asking users whose $HOME quota is not (yet) enforced (see line with $HOME in example below) to reduce their usage as soon as possible. Quota for $HOME if set is 20 GiB.

[saerda@login-3.SAGA ~]$ dusage -u saerda
Block quota usage on: SAGA
File system   User/Group   Usage   SoftLimit     HardLimit 
saerda_g  $HOME             6.9 TiB 0 Bytes     0 Bytes
saerda    saerda (u)        2.8 GiB    0 Bytes       0 Bytes

In parallel, we are trying to help users to reduce their usage and to increase the capacity of the file system, but these measures usually take time.

Many thanks in advance!

Change to the “optimist” jobs

The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.

(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)

Saga: file system issues

We’re currently having some issues with the storage backend on Saga. Users will experience a hanging prompt on the login nodes and when attempting to connect to them. We’re actively working on resolving these issues and apologize for the inconvenience.

UPDATE 2020-07-19 13:00: The issues on Saga has been resolved and we are resuming normal operations.

UPDATE 2020-07-09 13:20: We needed to reboot a part of the storage system to mitigate the file system issues. For now, we’re monitoring the situation and will send an update tomorrow. Users are advised to check results/jobs that ran from about midnight to noon today, however, we do not recommend rescheduling or submitting new jobs for now. Login nodes should be functional.

tos-project3 on NIRD is read only

Due to underlying hardware issues, tos-project3 filesystem is set to READ-ONLY while we investigate the issue.

These are the projects affected:

NN9999K
NS1002K
NS4704K
NS9001K
NS9012K
NS9014K
NS9033K
NS9054K
NS9063K
NS9066K
NS9114K
NS9191K
NS9320K
NS9404K
NS9518K
NS9602K
NS9615K
NS9641K
NS9672K
NS0000K
NS1004K
NS9000K
NS9003K
NS9013K
NS9021K
NS9035K
NS9060K
NS9064K
NS9081K
NS9133K
NS9305K
NS9357K
NS9478K
NS9560K
NS9603K
NS9616K
NS9655K
NS9999K