Dear users on Saga,
currently, usage on Saga’s parallel file system (everything under
/cluster) is at about 93%. Already, some of the file system servers are not accepting new data. If usage increases even further, soon the performance of the parallel file system may drop significantly, then some users may experience data loss and finally the whole cluster may come to a complete halt.
Therefore, we are kindly asking all users with large usage (check with the command
dusage) to cleanup unneeded data. Please, check all storage locations you’re storing data, that is,
$USERWORK, project folders (
/cluster/projects/...) and shared folders (
/cluster/shared/...). Particularly, we’re asking users whose
$HOME quota is not (yet) enforced (see line with
$HOME in example below) to reduce their usage as soon as possible. Quota for
$HOME if set is 20 GiB.
[saerda@login-3.SAGA ~]$ dusage -u saerda
Block quota usage on: SAGA
File system User/Group Usage SoftLimit HardLimit
saerda_g $HOME 6.9 TiB 0 Bytes 0 Bytes
saerda saerda (u) 2.8 GiB 0 Bytes 0 Bytes
In parallel, we are trying to help users to reduce their usage and to increase the capacity of the file system, but these measures usually take time.
Many thanks in advance!
The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.
(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)
At about 08:00 this morning, parts of the /cluster filesystem on Saga became unavailable. Typical errors will have been “‘Communication error on send”. The problem was discovered and fixed at around 08:50.
Some jobs will probably have been affected, so please check your jobs.
We are sorry for the inconvenience.
July 30, 12:52: Issue resolved
July 30, 12:18. We are experiencing issues accessing the NIRD storage from SAGA. This is due to a mounting issue and we do not have have an estimate on when this will be resolved due to most of the staff still on holidays. NIRD is still accessible from FRAM if you have access there as well. Sorry for the inconvenience.
We’re currently having some issues with the storage backend on Saga. Users will experience a hanging prompt on the login nodes and when attempting to connect to them. We’re actively working on resolving these issues and apologize for the inconvenience.
UPDATE 2020-07-19 13:00: The issues on Saga has been resolved and we are resuming normal operations.
UPDATE 2020-07-09 13:20: We needed to reboot a part of the storage system to mitigate the file system issues. For now, we’re monitoring the situation and will send an update tomorrow. Users are advised to check results/jobs that ran from about midnight to noon today, however, we do not recommend rescheduling or submitting new jobs for now. Login nodes should be functional.
Due to underlying hardware issues, tos-project3 filesystem is set to READ-ONLY while we investigate the issue.
These are the projects affected:
login-2,saga.sigma2.no is currently down duee two several faulty memory dimms. This also affects the use and functionality slurm browser and desktop.saga.sigma2.no
We hope to have the dimms replaced some time during the week.
23 April – 18:50 NIRD and the NIRD toolkit services are now back into production
24th April: Fram is back in production.
WARNING: MAINTENANCE IS CURRENTLY ONGOING!
Dear NIRD, NIRD Toolkit, and Fram User,
We will have a four day long scheduled maintenance on NIRD, NIRD Toolkit and Fram starting on the 20th of April, 09:00 AM.
Running HPC jobs and logging in to Saga is NOT affected.
NIRD connectivity, and backup of files, from Saga IS affected
During the maintenance we will:
- carry out software and firmware updates on all systems
Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram and Saga too.
Login services to NIRD, NIRD-toolkit and Fram will be disabled during the maintenance
Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.
Please accept our apologies for the inconvenience this downtime is causing.
The parallel file system on Saga is currently under a lot of stress caused by the running jobs.
We are working on optimizing and speed up the file system together with the vendor.
In the mean time we kindly ask you to follow the guidlines listed at our documentation pages.
As a general rule:
- file system performance decreases as the file system usage grows
- the number of I/O operations is directly influencing the responsiveness of the file system
- disk operations are with factor of thousand more expensive than memory operations
- the higher the number of files, the slower the I/O is
Thank you for your understanding!
Dear Saga User,
The usage of the /cluster file system on Saga has now bypassed 60%. To maintain the file system as responsive as possible, we have to periodically decrease the number of files, free up space and enforce automatic deletion of temporary files.
Starting with Wednesday, 19th of February we are going to activate the automatic cleanup of the $USERWORK (/cluster/work) area as documented here.
The retention period is:
- 42 days below 70% file system usage
- 21 days when file system usage reaches 70%.
Files older then the active retention period will be automatically deleted.
You can read more information about the storage areas on HPC clusters here and here.
Please copy all your important data from $USERWORK to your project area to avoid data loss.
Thank you for your understanding!