Current incidents
We have fixed the leakage and the system is back up. Currently running at 60% capacity over the weekend while we monitor the cooling solution and pumps. We are working with Uniq who are the service provider for the cooling solution and will increase to full load capacity as soon as possible.
Sorry for the inconveniences this has caused.
Best Regards,
Infra team.
We have discovered that on Betzy, in MPI jobs with more than 37 nodes, starting the executable with srun
fails with error messages “slurmstepd: error: Attempting to create node record past MaxNodeCount:0”. The fix is to use srun --mpi=pmix
instead. This happens with all MPI modules we have tested.
We don’t know the reason for this, but suspect it has started quite recently. We are currently investigating the issue.
Three notes:
- So far, we have only seen this on Betzy.
- It does not...