[2021-06-23 14:26] The issue is now solved and the jobs have now started to run. Please report if you experience any further issues to email@example.com
[2021-06-23 09:20] We are again experiencing problems on Betzy. We will update here when we’ve solved the issue.
[2021-06-22 11:15] The problem has been located and fixed, and Betzy should work as normal again.
[2021-06-22 09:30] We are currently experiencing network problems on Betzy. We don’t know the full extent of it, but it is at least affecting the queue system, so all Slurm-related commands are hanging.
We are investigating, and will update when we know more.
[2021-06-25 08:45] The maintenance stop is now over, and Saga is back in full production. There is a new version of Slurm (20.11.7), and storage on /cluster has been reorganised. This should be largely invisible, except that we will simplify the dusage command output to only show one set of quotas (pool 1).
[2021-06-25 08:15] Part of the file system reorganisation took longer than anticipated, but we will start putting Saga back into production now.
[2021-06-23 12:00] The maintenance has now started.
[UPDATE: The correct dates are June 23–24, not July]
There will be a maintenance stop of Saga starting June 23 at 12:00. The stop is planned to last until late June 24.
During the stop, the queue system Slurm will be upgraded to the latest version, and the /cluster file system storage will be reorganised so all user files will be in one storage pool. This will simplify disk quotas.
All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.
The 120 new nodes installed on Saga last week were unavaialble between 03:15 and 08:30 this morning, due to a configuration error. The configuration has been fixed and the nodes are back in production again.
60 jobs running on the nodes at the time of the incident were requeued and have later restarted again.
We are sorry for the inconvenience!
Today, Saga has been extended with 120 new compute nodes, increasing the total number of CPUs on the cluster from 9824 to 16064.
The new nodes have been added to the normal partition. They are identical to the old compute nodes in the partition, except that they have 52 CPU cores instead of 40.
We hope this extension will reduce the wait time for normal jobs on Saga.
login-1-3 on Fram had runaway processes that ended up using up all memory and swap, so we unfortunately had to reboot it.
The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.
(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)
At about 08:00 this morning, parts of the /cluster filesystem on Saga became unavailable. Typical errors will have been “‘Communication error on send”. The problem was discovered and fixed at around 08:50.
Some jobs will probably have been affected, so please check your jobs.
We are sorry for the inconvenience.
Quite a few users have lost access to their project(s) on Nird and all clusters during the weekend. This was due to a bug in the user administration software. The bug has been identified, and we are working on rolling back the changes.
We will update this page when access has been restored.
Update 12:30: Problem resolved. Project access has now been restored. If you still have problems, please contact support at firstname.lastname@example.org
Update: This applies to all systems, not only Fram and Saga.
Slurm was upgraded to the latest version (19.05.3-2) on Saga today. This includes a fix for the problem with using “srun” for running interactive jobs.
Please let us know if you notice anything that has gone wrong after the upgrade.
We are currently experiencing problems with the /cluster file system on Saga. This prevents users from logging in.
We are investigating, and will update here when we know more.
Update: 11:30 we have identified and solved the problem, now /cluster filesystem is back online.