Updated job statistics in slurm-NNN.out

[2023-01-10 15:40]

We have now updated the job statistics that is printed at the end of the slurm-NNN.out files. The output is updated on Saga and Fram, and will be updated on Betzy shortly.

We hope the new output is more readable, understandable and usable.

It is possible to get the same output on the terminal with the command jobstats -j <jobid> (note: this only works for jobs that are finished). The jobstats command also has a –verbose switch which will produce more detailed output, hints and comments (this will be expanded as time goes).

We have tested the changes on all clusters, but errors can happen, so if you spot any errors and/or missing output in your jobs, please let us know.

Saga sluggishness issues resolved

Summary: We’ve identified the root cause of the recent issues on Saga, and corrected it. We see significant improvements in the responsiveness of the file system, even though the load is still high.

Details for the curious: When the latest expansion of Saga was installed, a number of file system servers were incorrectly connected to the spine level of the Infiniband network. This didn’t cause immediate problems, but as the new servers got more and more loaded they eventually consumed all of the available links causing intermittent connection losses.

This was misinterpreted as signs that something was wrong with the BeeGFS file system, since it being slow was the most noticeable symptom. I started investigating and quite quickly found and corrected a number of tuning parameters that had not been applied correctly. While that did improve the performance somewhat, the Infiniband topology was still broken and inevitably the severe sluggishness of /cluster returned.

After checking and double checking everything I could think of on the file system side, I eventually got frustrated and turned to an old sysadmin trick: drop all your assumptions and start looking at the problem from a different angle. This led me to inspecting the Infiniband fabric to look for problems there – and in an instant the real problem presented itself.

In situations like this I always get mixed feelings. Relief that I now know what was going on, and frustration that I didn’t find it before – it was so obvious once I looked in the right place. I feel bad for you, the users, for not having caught this before. I also feel good because I managed to identify the problem, which had eluded both the vendors engineers and our own technical staff for so long.

We still have some minor issues yet to be resolved, but these should not have major impact on performance – or cause downtime/disruptions.

On behalf of the NRIS staff, I apologize for the inconvenience this has caused and thank you for your patience and understanding. I also hope you found this longer format opslog an interesting look into what goes on behind the scenes.

Wishing you all happy computing on Saga in the future!

Best regards,
Andreas Skau

Slow filesystem on Saga

Several users have reported slowness and poor performance on the Saga filesystem after the recent maintenance stop. We’ve been investigating, and found a likely cause in improper tuning parameters being set.

We are working on correcting this as soon as possible. Thank you for your reports, and sorry for the inconvenience.

UPDATE Nov. 9 15:26 CET: We have identified a likely root cause of the issue, and corrected it. The performance should now be back to what it was before the scheduled maintenance. We will continue our investigations to improve the service even further.

Saga maintenance stop 2022-10-24

[UPDATE, 2022-10-26 19:30: The maintenance is now over. The login nodes are open again, and jobs are running again.]

[UPDATE, 2022-10-24 08:05: The maintenance has now started]

There will be a maintenance stop on Saga starting Monday 2022-10-24 at 08:00. We expect the stop to last three days.

We have set up maintenance reservations on all nodes on the clusters, so jobs that would have run into the reservation will be left pending in the job queue until after the maintenance stop.