Slurm controller issues on Betzy affecting new jobs

Minor incident High Performance Computing Betzy
2024-05-29 10:04 CEST · 2 weeks, 4 days, 21 minutes

Updates

Update

We’ve now set the service to restart automatically, so the implact should be smaller.

May 29, 2024 · 16:38 CEST
Issue

The Slurm controller service on Betzy is crashing.

When this happens, new jobs cannot be started and jobs coming to the end of a step (srun/mpirun) will crash when trying to communicate with the Slurm controller.

We are investigating what is happening to fix the issue as soon as possible. For now, we are manually restarting the service as soon as we are notified.

May 29, 2024 · 10:04 CEST

← Back