Slurm controller issues on Betzy affecting new jobs
Minor incident
High Performance Computing
Betzy
2024-05-29 10:04 CEST
· 1 month, 4 days, 41 minutes
Updates
Update
May 29, 2024 · 16:38 CEST
We’ve now set the service to restart automatically, so the implact should be smaller.
Issue
May 29, 2024 · 10:04 CEST
The Slurm controller service on Betzy is crashing.
When this happens, new jobs cannot be started and jobs coming to the end of a step (srun/mpirun) will crash when trying to communicate with the Slurm controller.
We are investigating what is happening to fix the issue as soon as possible. For now, we are manually restarting the service as soon as we are notified.
← Back