Summary: We’ve identified the root cause of the recent issues on Saga, and corrected it. We see significant improvements in the responsiveness of the file system, even though the load is still high.
Details for the curious: When the latest expansion of Saga was installed, a number of file system servers were incorrectly connected to the spine level of the Infiniband network. This didn’t cause immediate problems, but as the new servers got more and more loaded they eventually consumed all of the available links causing intermittent connection losses.
This was misinterpreted as signs that something was wrong with the BeeGFS file system, since it being slow was the most noticeable symptom. I started investigating and quite quickly found and corrected a number of tuning parameters that had not been applied correctly. While that did improve the performance somewhat, the Infiniband topology was still broken and inevitably the severe sluggishness of /cluster returned.
After checking and double checking everything I could think of on the file system side, I eventually got frustrated and turned to an old sysadmin trick: drop all your assumptions and start looking at the problem from a different angle. This led me to inspecting the Infiniband fabric to look for problems there – and in an instant the real problem presented itself.
In situations like this I always get mixed feelings. Relief that I now know what was going on, and frustration that I didn’t find it before – it was so obvious once I looked in the right place. I feel bad for you, the users, for not having caught this before. I also feel good because I managed to identify the problem, which had eluded both the vendors engineers and our own technical staff for so long.
We still have some minor issues yet to be resolved, but these should not have major impact on performance – or cause downtime/disruptions.
On behalf of the NRIS staff, I apologize for the inconvenience this has caused and thank you for your patience and understanding. I also hope you found this longer format opslog an interesting look into what goes on behind the scenes.
Wishing you all happy computing on Saga in the future!