Dear Saga cluster Users:
We have discovered /cluster filesystem issue on Saga, which can lead to possible data corruption, to be able to examine the problem, we decided to suspend all running jobs on Saga and reserve entire cluster. No new job will be accepted until problem is resolved.
Users can still login to Saga login nodes.
We are sorry for any inconvenience this may have caused.
We will keep you updated as we progress.
Update: We are trying to repair the file system without killing all jobs. It might not work, at least not for all jobs. In the mean time, we have closed access to the login nodes to avoid more damage to the file system.
Update 14:15: Problem resolved, Saga is open again. Please check if you have running jobs, some of the jobs could get crashed.
The source of the problem is related to the underlying filesystem (XFS) and the current kernel that we are running. We scanned the underlying filesystem on our OSS servers to eliminate possible data corruption on /cluster filesystem, and we also updated kernel on OSS’es.
Please don’t hesitate to contact us if you have any questions
Slurm was upgraded to the latest version (19.05.3-2) on Saga today. This includes a fix for the problem with using “srun” for running interactive jobs.
Please let us know if you notice anything that has gone wrong after the upgrade.
Dear Fram User,
As of today we have adjusted the queue system policies to facilitate code development and testing on Fram and meanwhile limit possible misuse of devel queue.
devel is now adjusted to allow:
- max 4 node jobs
- max 30 minutes wall time
- max 1 job per user
We have additionally introduced a short queue with following settings:
- max 10 node jobs
- max 120 minutes wall time
- max 2 jobs per user
We will continue to monitor and improve the queue system. Please stay tuned.
You may find more information here.
As part of the ongoing work of tuning and improving the queue system on Fram for best resource usage and user experience, we recently upgraded Slurm to version 18.08.
This has enabled us to replace the limit of how many jobs per user the system will try to backfill with a more flexible setting.
Jobs are started and backfilled in priority order, and the priority of jobs increase with time as they are pending. With the new setting, only a fixed number of each user’s jobs within a project will increase in priority. The number is currently 10, but will probably be adjusted over time. This setting will make it easier for all users and projects to get jobs started when one user has submitted very many jobs over a short time, while at the same time not preventing jobs from starting when there are free resources.
Note that the setting is per user and project, so if more than one user submit jobs in the same project, each of them will get 10 jobs with priority increasing, and similarly, if one user submits jobs to several projects, the user will get 10 jobs with priority increasing in each project.
The setting is documented here.
Update: The upgrade is now done. All seems to have gone well, but it can be a good idea to check your jobs.
We will upgrade the queue system (Slurm) on Fram at 11:00 today, from version 17.11 to 18.08. The upgrade is expected to take 5-10 minutes. During that time, queue system commands (squeue, sbatch, etc.) will not work, but running jobs should not be affected.
Dear Fram User,
We are working on improving the queue system on Fram for best resource usage and user experience.
There is an ongoing work to test out new features in the latest versions for our queue system and apply them in production as soon as we are sure they will not have negative impact on jobs.
To give all users a more even chance to get their jobs started, we have now limited the number of jobs per user that the system will try to backfill.
We will keep you updated with new features as we implement them.
A sudden high pressure in the cooling system around 13 o’clock, has taken one of the cooling units down. Starting it back affected the other unit as well.
This triggered a safety stop for some of the computing nodes, leading to premature crash for some of the running jobs.
Affected jobs has been re-queued.
Apologies for the inconvenience it has created.
Fram has been in production for half a year now, and we’ve gathered enough data to see possible improvements on defaults. One such improvement is related to how jobs are placed with regards to the island topology on Fram. The way Fram is built, the network bandwidth within an island is far better than between islands. For certain types of jobs spanning many compute nodes, being spread over multiple islands can give a negative impact on performance.
To limit this effect we have now changed the default setup so that each job will run within one island, if that does not delay the job too much, as described here:
Note that this may lead to longer waiting in the queue, in particular for larger jobs. If your job does not depend on high network throughput, the above mentioned document also describes how to override the new default.
The SLURM queue system hung on Fram.
The problem has been remediated and the queue system is functional again since approximately 09:55.