srun in large MPI jobs on Betzy fails
Minor incident
High Performance Computing
Betzy
2024-03-12 10:22 CET
· 4 weeks, 1 day, 23 hours, 20 minutes
Updates
Investigating
March 12, 2024 · 10:22 CET
We have discovered that on Betzy, in MPI jobs with more than 37 nodes, starting the executable with srun
fails with error messages “slurmstepd: error: Attempting to create node record past MaxNodeCount:0”. The fix is to use srun --mpi=pmix
instead. This happens with all MPI modules we have tested.
We don’t know the reason for this, but suspect it has started quite recently. We are currently investigating the issue.
Three notes:
- So far, we have only seen this on Betzy.
-
It does not happen with
mpirun
. -
IntelMPI does not wirk with pmix, so here one needs to use
mpirun
for larger jobs for now.
← Back