Please cleanup data on Saga!

Dear users on Saga,

currently, usage on Saga’s parallel file system (everything under /cluster) is at about 93%. Already, some of the file system servers are not accepting new data. If usage increases even further, soon the performance of the parallel file system may drop significantly, then some users may experience data loss and finally the whole cluster may come to a complete halt.


Therefore, we are kindly asking all users with large usage (check with the command dusage) to cleanup unneeded data. Please, check all storage locations you’re storing data, that is, $HOME$USERWORK, project folders (/cluster/projects/...) and shared folders (/cluster/shared/...). Particularly, we’re asking users whose $HOME quota is not (yet) enforced (see line with $HOME in example below) to reduce their usage as soon as possible. Quota for $HOME if set is 20 GiB.

[saerda@login-3.SAGA ~]$ dusage -u saerda
Block quota usage on: SAGA
File system   User/Group   Usage   SoftLimit     HardLimit 
saerda_g  $HOME             6.9 TiB 0 Bytes     0 Bytes
saerda    saerda (u)        2.8 GiB    0 Bytes       0 Bytes

In parallel, we are trying to help users to reduce their usage and to increase the capacity of the file system, but these measures usually take time.

Many thanks in advance!

Betzy access closed, preparing for production

UPDATE:

  • 25.09.2020: We are temporarily reopening the access over the weekend in order to allow further testing on the machine.
    Further work is expected to be done by the vendor sometime next week and as a consequence, jobs will be terminated again and access closed while maintenance will be ongoing.

Dear Betzy pilots,

We are pleased to announce that despite logistics challenges caused by Covid-19, most of the outstanding issues were sorted out. This unusual situation requested a more dynamic approach from everyone involved, while putting pressure on the communication due to uncertainties and quick situation changes. Because of this, setting and advertising a production date proved to be difficult.

We can now start aiming for setting Betzy into production in the beginning of October. Before we can conclude, and proceed with the preparations, we need to re-run several comprehensive tests.

Therefore, we will have to stop all jobs and access to Betzy starting from tomorrow, 17 September 2020 10AM. Access to Betzy will be re-established as soon as all the tests are effectuated. Please be prepared for a more extensive maintenance this time, which might require up to two and half weeks.

The file system on Betzy is not going to be reformatted. That is, your data will not be removed intentionally. However, we can not guarantee data integrity until backups are taken and the machine is placed into production. Therefore, we strongly advise you to take a backup of your important data for the sake of security.

Apologies for the short notice and the inconvenience this is causing to you.

Best regards,

Lorand Szentannai, on behalf of the preparations team

Fram file system issues

11:30 15-09-2020 [Update 7]: Quick heads-up: We are trying to put one of the storage servers back into production. This could result in some users/jobs experiencing some short hangs. If you are in doubt about the behaviour of your jobs, please, do not hesitate to contact us at support@metacenter.no.

14:30 14-09-2020 [Update 6]: Most compute nodes are running now with the old lustre client. So, what regards the most recent issues, it should be safe to submit jobs. Unfortunately, this also means that the «hung io-wait issue» may happen again. Just contact us via support@metacenter.no in case you continue to have file system issues.

12:15 14-09-2020 [Update 5]: We found the reason for the behaviour many users have reported (problems with the module system, crashes, etc). It seems the new file system client causes this. So, the only immediate “solution” is to go back to the old version of the client. This may cause other issues, however, they are less severe than what we see now. We will inform here if it is safe to submit jobs.

10:30 14-09-2020 [Update 4]: Over the weekend, on the majority of compute nodes the lustre client for the parallel file system was updated. However, users are still reporting issues, particularly, when loading modules. It seems that the module system is not configured correctly on the updated nodes. We are looking into fixing the issue and keep you up-to-date here.

Sorry for the inconvenience!

15:00 11-09-2020[Update 3]: We are currently upgrading lustre filesystem clients to mitigate a «hung io-wait issue». We are also at reduced capacity performance-wice as one of eight io-servers are down. Full production is to be expected from Monday morning. A small hang is expected when io-server i phased in. We expect hung io-wait to go away during next two weeks as clients are upgraded

20:50 10-09-2020[Update 2] : Sorry to inform that we are still having some issues and vendor has been contacted

13:15 10-09-2020[Update 1] : The file system is partially back in operation. Which means you may use Fram but the performance will be sub-optimal. Some jobs may be affected when we try to bring back a object storage latter today.

08:15 10-09-2020 : We are experiencing some issues with the Fram file system and working on fix. Sorry for the inconvenience.

Change to the “optimist” jobs

The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.

(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)

Saga: file system issues

We’re currently having some issues with the storage backend on Saga. Users will experience a hanging prompt on the login nodes and when attempting to connect to them. We’re actively working on resolving these issues and apologize for the inconvenience.

UPDATE 2020-07-19 13:00: The issues on Saga has been resolved and we are resuming normal operations.

UPDATE 2020-07-09 13:20: We needed to reboot a part of the storage system to mitigate the file system issues. For now, we’re monitoring the situation and will send an update tomorrow. Users are advised to check results/jobs that ran from about midnight to noon today, however, we do not recommend rescheduling or submitting new jobs for now. Login nodes should be functional.

Fram off-line: File system issues

Dear Fram Users,

The ongoing problems on FRAM reported July 1st, cause the error message “No space left on device” for various file operations.

The problems are being investigated, and we will keep you updated on the progress.

UPDATE 2020-07-08 14:50: hugemem on Fram is now operating as normal.

UPDATE 2020-07-08 10:35: The file system issues have been resolved and we are operating as normal with the exception of hugemem, which is still unavailable. Please let us know if you’re still experiencing problems. Again we apologize for the inconvenience.

UPDATE 2020-07-08 09:00: Our vendor has corrected the filesystem bug and we should be operating as normal soon. At the moment we’re running some tests which will slow down current jobs running on Fram.

UPDATE 2020-07-07 15:35: The problem on Fram is caused by a bug in the Lustre filesystem. Our vendor is taking over the case to fix the issue. Thank you for your patience, we apologize for the inconvenience.

UPDATE 2020-07-07 09:50 : We are still experiencing file system errors on FRAM, and are working to resolve the issue as soon as possible. Watch this space for updates.

UPDATE 2020-07-06 12:30 : FRAM has been opened again.

UPDATE 2020-07-06 09:50 : The FS is up and running, it seems to be stable and this has also been verified by the vendor. It should be possible to use FRAM within couple of hours.

UPDATE 2020-07-03 17:10 : The FS is up and running but we have decided to keep the machine closed during the weekend so we are sure everything works as it should on Monday. The reason for many recent FRAM downtimes have been caused by storage hardware faults. We are investigating the issue together with the storage vendor.

UPDATE 2020-07-02 13:20 : FRAM is off-line, we are investigating the issues. The machine will probably stay off-line until tomorrow.

UPDATE 2020-07-02 12:10 : Whole file system is still very unstable, we will most likely have to take FRAM down, Slurm reservation created and all users might be kicked out soon.

UPDATE 2020-07-02 11:15 : Whole file system is still very unstable and we are trying to fix the problem.

Metacenter Operations