login-1-3 on Fram had runaway processes that ended up using up all memory and swap, so we unfortunately had to reboot it.
The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.
(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)
At about 08:00 this morning, parts of the /cluster filesystem on Saga became unavailable. Typical errors will have been “‘Communication error on send”. The problem was discovered and fixed at around 08:50.
Some jobs will probably have been affected, so please check your jobs.
We are sorry for the inconvenience.
Quite a few users have lost access to their project(s) on Nird and all clusters during the weekend. This was due to a bug in the user administration software. The bug has been identified, and we are working on rolling back the changes.
We will update this page when access has been restored.
Update 12:30: Problem resolved. Project access has now been restored. If you still have problems, please contact support at firstname.lastname@example.org
Update: This applies to all systems, not only Fram and Saga.
Slurm was upgraded to the latest version (19.05.3-2) on Saga today. This includes a fix for the problem with using “srun” for running interactive jobs.
Please let us know if you notice anything that has gone wrong after the upgrade.
We are currently experiencing problems with the /cluster file system on Saga. This prevents users from logging in.
We are investigating, and will update here when we know more.
Update: 11:30 we have identified and solved the problem, now /cluster filesystem is back online.
As part of the ongoing work of tuning and improving the queue system on Fram for best resource usage and user experience, we recently upgraded Slurm to version 18.08.
This has enabled us to replace the limit of how many jobs per user the system will try to backfill with a more flexible setting.
Jobs are started and backfilled in priority order, and the priority of jobs increase with time as they are pending. With the new setting, only a fixed number of each user’s jobs within a project will increase in priority. The number is currently 10, but will probably be adjusted over time. This setting will make it easier for all users and projects to get jobs started when one user has submitted very many jobs over a short time, while at the same time not preventing jobs from starting when there are free resources.
Note that the setting is per user and project, so if more than one user submit jobs in the same project, each of them will get 10 jobs with priority increasing, and similarly, if one user submits jobs to several projects, the user will get 10 jobs with priority increasing in each project.
The setting is documented here.
Update: The upgrade is now done. All seems to have gone well, but it can be a good idea to check your jobs.
We will upgrade the queue system (Slurm) on Fram at 11:00 today, from version 17.11 to 18.08. The upgrade is expected to take 5-10 minutes. During that time, queue system commands (squeue, sbatch, etc.) will not work, but running jobs should not be affected.
- 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
- 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.
Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data. This means stopping all jobs and user processes, and logging users out of the systems.
We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today. (Note that the NIRD project areas will _not_ be available until NIRD is up again.)
We will update this post with more information when we know more.
Update: There was a second hang at around 09. The reason for the hangs has been found and fixed.
There was a short hang on the /nird/home ($HOME) file system on Fram from 08:00 to 08:55 today. The file system is back to normal now. We are investigating the reason for it.
Jobs running in $SCRATCH, $USERWORK or in a project directory have most likely not been affected, but it is probably a good idea to check the status of your jobs.