Maintenance Stops on Saga, Fram and Betzy

[Update, 2022-04-30 11:10] The Fram and Saga maintenance is now over, and jobs are running again.

[Update, 2022-04-29 08:00] The Fram and Saga maintenances have now started.

[Update, 2022-04-28 12:56] The Betzy maintenance is now over, and jobs are starting again.

[Update, 2022-04-28 08:00] The Betzy maintenance has now started.

There will unfortunately be maintenance stops on all NRIS clusters next week, for an important security update. The maintenance stops will be

  • Betzy: Thursday, April 28. at 08:00
  • Fram and Saga: Friday, April 29. at 08:00

We expect the stops will last a couple of hours. We have set up maintenance reservations on all nodes on the clusters, so jobs that would have run into the reservation will be left pending in the job queue until after the maintenance stop.

We are sorry for the inconvenience this creates. We had hoped to be able to apply the security update with jobs running, but that turned out not to be possible.

[Resolved] NIRD mount unavailable on Saga and Betzy

We have identified that the NIRD mount is unavaialble on Saga and Betzy and are working on finding the cause and putting a fix in place.

28-03-2022-13:20 – Mounts should be back now, the problem was caused by Friday’s maintenance on network gear …

We hope that the above has not caused too much frustration for you guys and we would like to wish a very nice day to everyone !

NRIS HPC staff

Downtime on Saga and Betzy, Thursday February 3.

There will be a short maintenance stop of Saga and Betzy on Thursday, Feburary 3. at 15:00 CET, due to work on the cooling system in the data hall. The downtime is planned to last for three hours.

During the downtime, no jobs will run, but the login nodes and the /cluster file system will be up. Jobs that cannot finish before 15:00 at February 3, will be left pending in the queue until after the stop.

–gpus-per-task not working correctly on Saga

We have recently discovered that using ‘–gpus-per-task’ on Saga leads to wrong accounting within the Slurm system. This has two effects, first the job will not be scheduled as quickly as at should, because Slurm thinks the job will require more resources than it asks for. Secondly, the job will actually be deducted more project hours than it should.

This is a bug in the Slurm batch system which we are trying to fix as quickly as possible.

For now, we recommend all GPU users to revert to ‘–gpus’ or ‘–gpus-per-node’ which we have ensured behaves as they should.

[FINISHED] Saga downtime 17th November 12:00 -19th November 15:00

[Update: 2021-11-19 09:50] The maintenance work is now done, and Saga is back in full production and running jobs as normal.

[Update, 2021-11-18 12:40] login nodes are ready for users, users can access their data and work with it. Compute nodes are still under maintenance thus running jobs are still not possible.

[Update, 2021-11-17 12:00]: The maintenance has now started

We will conduct firmware update/maintenance on all of Saga during next week, starting on Wednesday 17th 12:00

Downtime will last until 15:00 on friday 19th, but we will bring back access to login nodes and file system as soon as the upgrade is done on vital parts of the system. Compute ndes will be brought back sequentially while they are updated.