Betzy: Ongoing problems with srun and Infiniband

Betzy is experiencing issues with srun and Infiniband that now and again affect users. The srun problem first appeared last week, the Infiniband problem has been there longer.

Symptoms for srun problem are of a type related to send/recv operation messages. If you see such error-messages you are probably affected by the problem. A workaround can be to just keep trying to run the srun job until it succeeds.

Symptoms for the Infiniband problem are messages of type Transport retry count exceeded. At the moment there is no workaround for this problem.

The expert-team is working on solving the problems together with the hardware vendors. We are very sorry for the inconvenience, but can assure you that the team is working hard on solving this.

Update 15:00 05.02: Regarding to srun problem, we have eliminated the issue, and srun should work as expected. Please report to us if you see any problem related to srun. Infiniband problem is most probably due to bad nodes, we are still working on the issue, and we are seeing less problem related to infinband as we isolated bad nodes.

Update 14:00 19.02: Issues with InfiniBand have been resolved as well. If there are any issues, please report.