–gpus-per-task not working correctly on Saga

We have recently discovered that using ‘–gpus-per-task’ on Saga leads to wrong accounting within the Slurm system. This has two effects, first the job will not be scheduled as quickly as at should, because Slurm thinks the job will require more resources than it asks for. Secondly, the job will actually be deducted more project hours than it should.

This is a bug in the Slurm batch system which we are trying to fix as quickly as possible.

For now, we recommend all GPU users to revert to ‘–gpus’ or ‘–gpus-per-node’ which we have ensured behaves as they should.

[FINISHED] Saga downtime 17th November 12:00 -19th November 15:00

[Update: 2021-11-19 09:50] The maintenance work is now done, and Saga is back in full production and running jobs as normal.

[Update, 2021-11-18 12:40] login nodes are ready for users, users can access their data and work with it. Compute nodes are still under maintenance thus running jobs are still not possible.

[Update, 2021-11-17 12:00]: The maintenance has now started

We will conduct firmware update/maintenance on all of Saga during next week, starting on Wednesday 17th 12:00

Downtime will last until 15:00 on friday 19th, but we will bring back access to login nodes and file system as soon as the upgrade is done on vital parts of the system. Compute ndes will be brought back sequentially while they are updated.

[Solved] Saga file system performance issue

We’re aware of ongoing issues with the file system performance on Saga and are investigating the cause. This also affects logging in to Saga, where the terminal will hang waiting for a prompt.

Updates will be provided in this post as soon as we have more information to share.

Sorry for the inconvenience.

Update 2021-07-15, 16:33: The issue was identified as a faulty connection between the storage server and the cluster. Performance should be back to normal, but we will monitor the system a bit more before declaring it healthy.
Update 2021-07-14, 15:00: We’ve discovered some faulty drives that are currently being swapped. We hope that the performance will improve once these are in production again.
Update 2021-07-13, 10:03: The file system is a bit more stable now, but we’re still looking into the cause for the degraded performance.

[DONE] Saga Maintenance Stop 23–24 June

[2021-06-25 08:45] The maintenance stop is now over, and Saga is back in full production. There is a new version of Slurm (20.11.7), and storage on /cluster has been reorganised. This should be largely invisible, except that we will simplify the dusage command output to only show one set of quotas (pool 1).

[2021-06-25 08:15] Part of the file system reorganisation took longer than anticipated, but we will start putting Saga back into production now.

[2021-06-23 12:00] The maintenance has now started.

[UPDATE: The correct dates are June 23–24, not July]

There will be a maintenance stop of Saga starting June 23 at 12:00. The stop is planned to last until late June 24.

During the stop, the queue system Slurm will be upgraded to the latest version, and the /cluster file system storage will be reorganised so all user files will be in one storage pool. This will simplify disk quotas.

All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.

Slow file system on Saga

We’re experiencing very slow file system on Saga at the moment and are working on identifying the cause.

Update 13:09: The file system is much more responsive now, but we’re still seeing that logins are hanging for ~30 seconds before getting access to the file system. This is being investigated further.

Updates will be provided once we have more information.

Sorry about the inconvenience.

New compute nodes on Saga

Today, Saga has been extended with 120 new compute nodes, increasing the total number of CPUs on the cluster from 9824 to 16064.

The new nodes have been added to the normal partition. They are identical to the old compute nodes in the partition, except that they have 52 CPU cores instead of 40.

We hope this extension will reduce the wait time for normal jobs on Saga.