The 120 new nodes installed on Saga last week were unavaialble between 03:15 and 08:30 this morning, due to a configuration error. The configuration has been fixed and the nodes are back in production again.
60 jobs running on the nodes at the time of the incident were requeued and have later restarted again.
We are sorry for the inconvenience!
Today, Saga has been extended with 120 new compute nodes, increasing the total number of CPUs on the cluster from 9824 to 16064.
The new nodes have been added to the normal partition. They are identical to the old compute nodes in the partition, except that they have 52 CPU cores instead of 40.
We hope this extension will reduce the wait time for normal jobs on Saga.
login-1-3 on Fram had runaway processes that ended up using up all memory and swap, so we unfortunately had to reboot it.
The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.
(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)
At about 08:00 this morning, parts of the /cluster filesystem on Saga became unavailable. Typical errors will have been “‘Communication error on send”. The problem was discovered and fixed at around 08:50.
Some jobs will probably have been affected, so please check your jobs.
We are sorry for the inconvenience.
Quite a few users have lost access to their project(s) on Nird and all clusters during the weekend. This was due to a bug in the user administration software. The bug has been identified, and we are working on rolling back the changes.
We will update this page when access has been restored.
Update 12:30: Problem resolved. Project access has now been restored. If you still have problems, please contact support at email@example.com
Update: This applies to all systems, not only Fram and Saga.
Slurm was upgraded to the latest version (19.05.3-2) on Saga today. This includes a fix for the problem with using “srun” for running interactive jobs.
Please let us know if you notice anything that has gone wrong after the upgrade.
We are currently experiencing problems with the /cluster file system on Saga. This prevents users from logging in.
We are investigating, and will update here when we know more.
Update: 11:30 we have identified and solved the problem, now /cluster filesystem is back online.
As part of the ongoing work of tuning and improving the queue system on Fram for best resource usage and user experience, we recently upgraded Slurm to version 18.08.
This has enabled us to replace the limit of how many jobs per user the system will try to backfill with a more flexible setting.
Jobs are started and backfilled in priority order, and the priority of jobs increase with time as they are pending. With the new setting, only a fixed number of each user’s jobs within a project will increase in priority. The number is currently 10, but will probably be adjusted over time. This setting will make it easier for all users and projects to get jobs started when one user has submitted very many jobs over a short time, while at the same time not preventing jobs from starting when there are free resources.
Note that the setting is per user and project, so if more than one user submit jobs in the same project, each of them will get 10 jobs with priority increasing, and similarly, if one user submits jobs to several projects, the user will get 10 jobs with priority increasing in each project.
The setting is documented here.
Update: The upgrade is now done. All seems to have gone well, but it can be a good idea to check your jobs.
We will upgrade the queue system (Slurm) on Fram at 11:00 today, from version 17.11 to 18.08. The upgrade is expected to take 5-10 minutes. During that time, queue system commands (squeue, sbatch, etc.) will not work, but running jobs should not be affected.