We are diagnosing reports of filesystem errors on Betzy. Subsequently diagnostics might make the filesystem unstable for a couple of hours. Hopefully users will not notice.
Update 11.08 @ 13:55:
We are running fsck as well as rebuilding some internal metadata. This process can result in performance degradation of filesystem operations in submitted jobs. Due to the nature of these checks, it is expected to last until tomorrow.
Update 12.08 @ 10:45:
Work on the filesystem is still ongoing and potential disruptions could still occur during the day.
Update 15.08 @ 10:55:
The filesystem issue is resolved and Betzy fs is working. If you should experience any problems please contact support.
We have recently discovered that using ‘–gpus-per-task’ on Saga leads to wrong accounting within the Slurm system. This has two effects, first the job will not be scheduled as quickly as at should, because Slurm thinks the job will require more resources than it asks for. Secondly, the job will actually be deducted more project hours than it should.
This is a bug in the Slurm batch system which we are trying to fix as quickly as possible.
For now, we recommend all GPU users to revert to ‘–gpus’ or ‘–gpus-per-node’ which we have ensured behaves as they should.
There is currently an issue on Betzy with the batch system which results in jobs not completing and new jobs not being started.
We are currently investigating the issue and will update once we know what caused it and how it can be resolved.
[Update 14:22]: Job submission is working again. The users experiencing this were unfortunately victims of a batch system restart which happened at the same time as the job was submitted.