Saga: file system issues

We’re currently having some issues with the storage backend on Saga. Users will experience a hanging prompt on the login nodes and when attempting to connect to them. We’re actively working on resolving these issues and apologize for the inconvenience.

UPDATE 2020-07-09 13:20: We needed to reboot a part of the storage system to mitigate the file system issues. For now, we’re monitoring the situation and will send an update tomorrow. Users are advised to check results/jobs that ran from about midnight to noon today, however, we do not recommend rescheduling or submitting new jobs for now. Login nodes should be functional.

Fram off-line: File system issues

Dear Fram Users,

The ongoing problems on FRAM reported July 1st, cause the error message “No space left on device” for various file operations.

The problems are being investigated, and we will keep you updated on the progress.

UPDATE 2020-07-08 14:50: hugemem on Fram is now operating as normal.

UPDATE 2020-07-08 10:35: The file system issues have been resolved and we are operating as normal with the exception of hugemem, which is still unavailable. Please let us know if you’re still experiencing problems. Again we apologize for the inconvenience.

UPDATE 2020-07-08 09:00: Our vendor has corrected the filesystem bug and we should be operating as normal soon. At the moment we’re running some tests which will slow down current jobs running on Fram.

UPDATE 2020-07-07 15:35: The problem on Fram is caused by a bug in the Lustre filesystem. Our vendor is taking over the case to fix the issue. Thank you for your patience, we apologize for the inconvenience.

UPDATE 2020-07-07 09:50 : We are still experiencing file system errors on FRAM, and are working to resolve the issue as soon as possible. Watch this space for updates.

UPDATE 2020-07-06 12:30 : FRAM has been opened again.

UPDATE 2020-07-06 09:50 : The FS is up and running, it seems to be stable and this has also been verified by the vendor. It should be possible to use FRAM within couple of hours.

UPDATE 2020-07-03 17:10 : The FS is up and running but we have decided to keep the machine closed during the weekend so we are sure everything works as it should on Monday. The reason for many recent FRAM downtimes have been caused by storage hardware faults. We are investigating the issue together with the storage vendor.

UPDATE 2020-07-02 13:20 : FRAM is off-line, we are investigating the issues. The machine will probably stay off-line until tomorrow.

UPDATE 2020-07-02 12:10 : Whole file system is still very unstable, we will most likely have to take FRAM down, Slurm reservation created and all users might be kicked out soon.

UPDATE 2020-07-02 11:15 : Whole file system is still very unstable and we are trying to fix the problem.

Metacenter Operations

Stallo up and running

Stallo is now up and running again. Unfortunately, the old lad lost two racks during hibernation. We are looking into it and will report when things are up and ok again. Please look into jobs that have been restarted and report back if they produce no output.