Slurm controller issues on Betzy affecting new jobs
Minor incident
High Performance Computing
Betzy
2024-05-29 08:04
· 1 month, 4 days, 41 minutes
Updates
Times are shown in Africa/Abidjan timezone
Update
May 29, 2024 · 14:38
We’ve now set the service to restart automatically, so the implact should be smaller.
Issue
May 29, 2024 · 08:04
The Slurm controller service on Betzy is crashing.
When this happens, new jobs cannot be started and jobs coming to the end of a step (srun/mpirun) will crash when trying to communicate with the Slurm controller.
We are investigating what is happening to fix the issue as soon as possible. For now, we are manually restarting the service as soon as we are notified.
← Back