One of the core infiniband switches shut down on Betzy. This caused connectivity issues for the queueing system and caused jobs not to be scheduled/run. We are still investigating the issue to find the exact cause.
Slurm controller on Fram is crashed, we are investigating.
Update 08.07.2022 14:00 : Fram workload manager node (slurm) crashed again and all running jobs are died.
There is hardware issue discovered, we are in contact with vendor and doing the tests. we will keep you updated.
Access to login nodes will be still open until planned Fram downtime.
We have recently discovered that using ‘–gpus-per-task’ on Saga leads to wrong accounting within the Slurm system. This has two effects, first the job will not be scheduled as quickly as at should, because Slurm thinks the job will require more resources than it asks for. Secondly, the job will actually be deducted more project hours than it should.
This is a bug in the Slurm batch system which we are trying to fix as quickly as possible.
For now, we recommend all GPU users to revert to ‘–gpus’ or ‘–gpus-per-node’ which we have ensured behaves as they should.
There is currently an issue on Betzy with the batch system which results in jobs not completing and new jobs not being started.
We are currently investigating the issue and will update once we know what caused it and how it can be resolved.
[Update 14:22]: Job submission is working again. The users experiencing this were unfortunately victims of a batch system restart which happened at the same time as the job was submitted.