There will be a scheduled downtime for Betzy lasting three days starting on Monday 6th December at 08:00. Downtime will last until Thursday 9th, 08:00.
During the downtime we will conduct:
- Full upgrade of the Lustre filesystem (both servers and clients)
- Full upgrade of the infiniband firmware
- Full upgrade of the Mellanox infiniband drivers
- minor updates to other parts of the system (Slurm, configs, etc)
Please be aware that this does also affect the storage services recently moved from NIRD to Betzy.
We apologize for the inconvenience
We’re aware of ongoing issues with the file system performance on Saga and are investigating the cause. This also affects logging in to Saga, where the terminal will hang waiting for a prompt.
Updates will be provided in this post as soon as we have more information to share.
Sorry for the inconvenience.
Update 2021-07-15, 16:33: The issue was identified as a faulty connection between the storage server and the cluster. Performance should be back to normal, but we will monitor the system a bit more before declaring it healthy.
Update 2021-07-14, 15:00: We’ve discovered some faulty drives that are currently being swapped. We hope that the performance will improve once these are in production again.
Update 2021-07-13, 10:03: The file system is a bit more stable now, but we’re still looking into the cause for the degraded performance.
We are currently experiencing degraded performance for the Betzy file system. We are investigating together with vendor to find the culprit for this issue
[2021-06-25 08:45] The maintenance stop is now over, and Saga is back in full production. There is a new version of Slurm (20.11.7), and storage on /cluster has been reorganised. This should be largely invisible, except that we will simplify the dusage command output to only show one set of quotas (pool 1).
[2021-06-25 08:15] Part of the file system reorganisation took longer than anticipated, but we will start putting Saga back into production now.
[2021-06-23 12:00] The maintenance has now started.
[UPDATE: The correct dates are June 23–24, not July]
There will be a maintenance stop of Saga starting June 23 at 12:00. The stop is planned to last until late June 24.
During the stop, the queue system Slurm will be upgraded to the latest version, and the /cluster file system storage will be reorganised so all user files will be in one storage pool. This will simplify disk quotas.
All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.
We’re experiencing very slow file system on Saga at the moment and are working on identifying the cause.
Update 13:09: The file system is much more responsive now, but we’re still seeing that logins are hanging for ~30 seconds before getting access to the file system. This is being investigated further.
Updates will be provided once we have more information.
Sorry about the inconvenience.
Update: The file system servers have now been fixed, and we are back online again. Thank you for your patience.
We have an ongoing performance issue with Fram filesystem. We need to shut down file servers to get this fixed, and therefore need to have three hours downtime:
Wednesday 20th January between 12:00 and 15:00, Fram will be unavailable
As previously announced, Saga will be down in the coming week, from 7th December 08:00 until 11th December 16:00.
The downtime is allocated for expanding the storage. When we come back we will have ca 4 Petabyte in addition to the already existing 1 PetaByte.
Update: Saga is back online and running jobs again. The new storage is not online yet, but all the hardware has been mounted.
We are going to expand the storage on Saga. This will happen during week 50, between 7th and 11th December. Hopefully this will give oss a few Petabytes extra and enough storage for the lifetime of the system.
As of today, Wednesday 4th, November at 08:00, Fram is down for maintenance. We will do the same exercise as on NIRD-TOS, namely change all internal cables on the storage system.
17:20 NIRD-TOS and services are now up.
Dear users on Saga,
currently, usage on Saga’s parallel file system (everything under
/cluster) is at about 93%. Already, some of the file system servers are not accepting new data. If usage increases even further, soon the performance of the parallel file system may drop significantly, then some users may experience data loss and finally the whole cluster may come to a complete halt.
Therefore, we are kindly asking all users with large usage (check with the command
dusage) to cleanup unneeded data. Please, check all storage locations you’re storing data, that is,
$USERWORK, project folders (
/cluster/projects/...) and shared folders (
/cluster/shared/...). Particularly, we’re asking users whose
$HOME quota is not (yet) enforced (see line with
$HOME in example below) to reduce their usage as soon as possible. Quota for
$HOME if set is 20 GiB.
[saerda@login-3.SAGA ~]$ dusage -u saerda
Block quota usage on: SAGA
File system User/Group Usage SoftLimit HardLimit
saerda_g $HOME 6.9 TiB 0 Bytes 0 Bytes
saerda saerda (u) 2.8 GiB 0 Bytes 0 Bytes
In parallel, we are trying to help users to reduce their usage and to increase the capacity of the file system, but these measures usually take time.
Many thanks in advance!