As previously announced, Saga will be down in the coming week, from 7th December 08:00 until 11th December 16:00.
The downtime is allocated for expanding the storage. When we come back we will have ca 4 Petabyte in addition to the already existing 1 PetaByte.
Update: Saga is back online and running jobs again. The new storage is not online yet, but all the hardware has been mounted.
We are going to expand the storage on Saga. This will happen during week 50, between 7th and 11th December. Hopefully this will give oss a few Petabytes extra and enough storage for the lifetime of the system.
All services and file systems are now back in operation, including NIRD-Services and NIRD mounts on HPC systems
Please be aware that some projects on NIRD has changed home systems from NIRD-TOS to NIRD-TRD and vice versa.
Dear users on Saga,
currently, usage on Saga’s parallel file system (everything under
/cluster) is at about 93%. Already, some of the file system servers are not accepting new data. If usage increases even further, soon the performance of the parallel file system may drop significantly, then some users may experience data loss and finally the whole cluster may come to a complete halt.
Therefore, we are kindly asking all users with large usage (check with the command
dusage) to cleanup unneeded data. Please, check all storage locations you’re storing data, that is,
$USERWORK, project folders (
/cluster/projects/...) and shared folders (
/cluster/shared/...). Particularly, we’re asking users whose
$HOME quota is not (yet) enforced (see line with
$HOME in example below) to reduce their usage as soon as possible. Quota for
$HOME if set is 20 GiB.
[saerda@login-3.SAGA ~]$ dusage -u saerda
Block quota usage on: SAGA
File system User/Group Usage SoftLimit HardLimit
saerda_g $HOME 6.9 TiB 0 Bytes 0 Bytes
saerda saerda (u) 2.8 GiB 0 Bytes 0 Bytes
In parallel, we are trying to help users to reduce their usage and to increase the capacity of the file system, but these measures usually take time.
Many thanks in advance!
The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.
(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)
At about 08:00 this morning, parts of the /cluster filesystem on Saga became unavailable. Typical errors will have been “‘Communication error on send”. The problem was discovered and fixed at around 08:50.
Some jobs will probably have been affected, so please check your jobs.
We are sorry for the inconvenience.
July 30, 12:52: Issue resolved
July 30, 12:18. We are experiencing issues accessing the NIRD storage from SAGA. This is due to a mounting issue and we do not have have an estimate on when this will be resolved due to most of the staff still on holidays. NIRD is still accessible from FRAM if you have access there as well. Sorry for the inconvenience.
We’re currently having some issues with the storage backend on Saga. Users will experience a hanging prompt on the login nodes and when attempting to connect to them. We’re actively working on resolving these issues and apologize for the inconvenience.
UPDATE 2020-07-19 13:00: The issues on Saga has been resolved and we are resuming normal operations.
UPDATE 2020-07-09 13:20: We needed to reboot a part of the storage system to mitigate the file system issues. For now, we’re monitoring the situation and will send an update tomorrow. Users are advised to check results/jobs that ran from about midnight to noon today, however, we do not recommend rescheduling or submitting new jobs for now. Login nodes should be functional.
Due to underlying hardware issues, tos-project3 filesystem is set to READ-ONLY while we investigate the issue.
These are the projects affected:
login-2,saga.sigma2.no is currently down duee two several faulty memory dimms. This also affects the use and functionality slurm browser and desktop.saga.sigma2.no
We hope to have the dimms replaced some time during the week.