srun in large MPI jobs on Betzy fails

Minor incident High Performance Computing Betzy
2024-03-12 10:22 CET · 4 weeks, 1 day, 23 hours, 20 minutes

Updates

Resolved

The problem has been fixed now.

April 11, 2024 · 10:42 CEST
Investigating

We have discovered that on Betzy, in MPI jobs with more than 37 nodes, starting the executable with srun fails with error messages “slurmstepd: error: Attempting to create node record past MaxNodeCount:0”. The fix is to use srun --mpi=pmix instead. This happens with all MPI modules we have tested.

We don’t know the reason for this, but suspect it has started quite recently. We are currently investigating the issue.

Three notes:

  • So far, we have only seen this on Betzy.
  • It does not happen with mpirun.
  • IntelMPI does not wirk with pmix, so here one needs to use mpirun for larger jobs for now.
March 12, 2024 · 10:22 CET

← Back