Problems with MPI jobs after Slurm upgrade

Minor incident High Performance Computing Betzy
2024-04-29 15:00 CEST · 2 days, 22 hours, 11 minutes

Updates

Resolved

The nodes have been rebooted now, and the problem should be gone.

May 2, 2024 · 13:11 CEST
Update

Rebooting helps, so we are rebooting all idle nodes now, and occupied nodes will be rebooted before they run new jobs.

April 29, 2024 · 15:43 CEST
Issue

There seems to be a problem running MPI jobs on Betzy since the Slurm upgrade this morning.
So far it looks like OpenMPI jobs that use mpirun are affected. A workaround that seems to work is using srun instead.

We believe the reason for the problem is that the compute nodes must be rebooted before they get the new version. We are testing this and if so, will reboot nodes as soon as they are idle.

April 29, 2024 · 15:00 CEST

← Back