Problems with MPI jobs after Slurm upgrade
Minor incident
High Performance Computing
Betzy
2024-04-29 15:00 CEST
· 2 days, 22 hours, 11 minutes
Updates
Update
April 29, 2024 · 15:43 CEST
Rebooting helps, so we are rebooting all idle nodes now, and occupied nodes will be rebooted before they run new jobs.
Issue
April 29, 2024 · 15:00 CEST
There seems to be a problem running MPI jobs on Betzy since the Slurm upgrade this morning.
So far it looks like OpenMPI jobs that use mpirun
are affected. A workaround that seems to work is using srun
instead.
We believe the reason for the problem is that the compute nodes must be rebooted before they get the new version. We are testing this and if so, will reboot nodes as soon as they are idle.
← Back