[UPDATE, 2021-06-08 08:00] Betzy is now up and in production again.
[UPDATE] Unfortunately, the downtime is taking longer than anticipated, and will not be finished tonight. We plan on getting Betzy up again at around 08:00 tomorrow morning.
Campusservice at NTNU will conduct maintenance on the High Voltage circuits for Non-redundant power on 7th of June 2021, between 15:00 and 20:00. All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.
The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.
(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)
Queueing system on Vilje has crashed. We are working on a fix
Quite a few users have lost access to their project(s) on Nird and all clusters during the weekend. This was due to a bug in the user administration software. The bug has been identified, and we are working on rolling back the changes.
We will update this page when access has been restored.
Update 12:30: Problem resolved. Project access has now been restored. If you still have problems, please contact support at firstname.lastname@example.org
Update: This applies to all systems, not only Fram and Saga.
Dear Saga cluster Users:
We have discovered /cluster filesystem issue on Saga, which can lead to possible data corruption, to be able to examine the problem, we decided to suspend all running jobs on Saga and reserve entire cluster. No new job will be accepted until problem is resolved.
Users can still login to Saga login nodes.
We are sorry for any inconvenience this may have caused.
We will keep you updated as we progress.
Update: We are trying to repair the file system without killing all jobs. It might not work, at least not for all jobs. In the mean time, we have closed access to the login nodes to avoid more damage to the file system.
Update 14:15: Problem resolved, Saga is open again. Please check if you have running jobs, some of the jobs could get crashed.
The source of the problem is related to the underlying filesystem (XFS) and the current kernel that we are running. We scanned the underlying filesystem on our OSS servers to eliminate possible data corruption on /cluster filesystem, and we also updated kernel on OSS’es.
Please don’t hesitate to contact us if you have any questions
Slurm was upgraded to the latest version (19.05.3-2) on Saga today. This includes a fix for the problem with using “srun” for running interactive jobs.
Please let us know if you notice anything that has gone wrong after the upgrade.
Dear Fram User,
As of today we have adjusted the queue system policies to facilitate code development and testing on Fram and meanwhile limit possible misuse of devel queue.
devel is now adjusted to allow:
- max 4 node jobs
- max 30 minutes wall time
- max 1 job per user
We have additionally introduced a short queue with following settings:
- max 10 node jobs
- max 120 minutes wall time
- max 2 jobs per user
We will continue to monitor and improve the queue system. Please stay tuned.
You may find more information here.
As part of the ongoing work of tuning and improving the queue system on Fram for best resource usage and user experience, we recently upgraded Slurm to version 18.08.
This has enabled us to replace the limit of how many jobs per user the system will try to backfill with a more flexible setting.
Jobs are started and backfilled in priority order, and the priority of jobs increase with time as they are pending. With the new setting, only a fixed number of each user’s jobs within a project will increase in priority. The number is currently 10, but will probably be adjusted over time. This setting will make it easier for all users and projects to get jobs started when one user has submitted very many jobs over a short time, while at the same time not preventing jobs from starting when there are free resources.
Note that the setting is per user and project, so if more than one user submit jobs in the same project, each of them will get 10 jobs with priority increasing, and similarly, if one user submits jobs to several projects, the user will get 10 jobs with priority increasing in each project.
The setting is documented here.
Update: The upgrade is now done. All seems to have gone well, but it can be a good idea to check your jobs.
We will upgrade the queue system (Slurm) on Fram at 11:00 today, from version 17.11 to 18.08. The upgrade is expected to take 5-10 minutes. During that time, queue system commands (squeue, sbatch, etc.) will not work, but running jobs should not be affected.
Dear Fram User,
We are working on improving the queue system on Fram for best resource usage and user experience.
There is an ongoing work to test out new features in the latest versions for our queue system and apply them in production as soon as we are sure they will not have negative impact on jobs.
To give all users a more even chance to get their jobs started, we have now limited the number of jobs per user that the system will try to backfill.
We will keep you updated with new features as we implement them.