Saga sluggishness issues resolved

Summary: We’ve identified the root cause of the recent issues on Saga, and corrected it. We see significant improvements in the responsiveness of the file system, even though the load is still high.

Details for the curious: When the latest expansion of Saga was installed, a number of file system servers were incorrectly connected to the spine level of the Infiniband network. This didn’t cause immediate problems, but as the new servers got more and more loaded they eventually consumed all of the available links causing intermittent connection losses.

This was misinterpreted as signs that something was wrong with the BeeGFS file system, since it being slow was the most noticeable symptom. I started investigating and quite quickly found and corrected a number of tuning parameters that had not been applied correctly. While that did improve the performance somewhat, the Infiniband topology was still broken and inevitably the severe sluggishness of /cluster returned.

After checking and double checking everything I could think of on the file system side, I eventually got frustrated and turned to an old sysadmin trick: drop all your assumptions and start looking at the problem from a different angle. This led me to inspecting the Infiniband fabric to look for problems there – and in an instant the real problem presented itself.

In situations like this I always get mixed feelings. Relief that I now know what was going on, and frustration that I didn’t find it before – it was so obvious once I looked in the right place. I feel bad for you, the users, for not having caught this before. I also feel good because I managed to identify the problem, which had eluded both the vendors engineers and our own technical staff for so long.

We still have some minor issues yet to be resolved, but these should not have major impact on performance – or cause downtime/disruptions.

On behalf of the NRIS staff, I apologize for the inconvenience this has caused and thank you for your patience and understanding. I also hope you found this longer format opslog an interesting look into what goes on behind the scenes.

Wishing you all happy computing on Saga in the future!

Best regards,
Andreas Skau

Slow filesystem on Saga

Several users have reported slowness and poor performance on the Saga filesystem after the recent maintenance stop. We’ve been investigating, and found a likely cause in improper tuning parameters being set.

We are working on correcting this as soon as possible. Thank you for your reports, and sorry for the inconvenience.

UPDATE Nov. 9 15:26 CET: We have identified a likely root cause of the issue, and corrected it. The performance should now be back to what it was before the scheduled maintenance. We will continue our investigations to improve the service even further.

Home directory file permissions

In accordance with the Data handling and Storage policy we will shortly enable automatic enforcement of file permissions on your home directories. We expect this to take place after the next maintenance stop.

This means that you may no longer grant other users/groups read or write access to your home directory. Any sharing of data between users must be done through project or work directories.

We take this opportunity to remind you that your home directory contents are treated as private data by the Metacenter staff and will not be shared with other users, even with your supervisor or project leader without your prior, written consent. Should you be unable to give consent, requests will be handled in accordance with applicable laws and regulations.

Please remember to share necessary data as required before changing jobs, leaves of absence and so on.

Best regards,

the Metacenter security team