–gpus-per-task not working correctly on Saga

We have recently discovered that using ‘–gpus-per-task’ on Saga leads to wrong accounting within the Slurm system. This has two effects, first the job will not be scheduled as quickly as at should, because Slurm thinks the job will require more resources than it asks for. Secondly, the job will actually be deducted more project hours than it should.

This is a bug in the Slurm batch system which we are trying to fix as quickly as possible.

For now, we recommend all GPU users to revert to ‘–gpus’ or ‘–gpus-per-node’ which we have ensured behaves as they should.

Downtime Betzy, 6th December 08:00 until 9th December 08:00

There will be a scheduled downtime for Betzy lasting three days starting on Monday 6th December at 08:00. Downtime will last until Thursday 9th, 08:00.

During the downtime we will conduct:

  • Full upgrade of the Lustre filesystem (both servers and clients)
  • Full upgrade of the infiniband firmware
  • Full upgrade of the Mellanox infiniband drivers
  • minor updates to other parts of the system (Slurm, configs, etc)

Please be aware that this does also affect the storage services recently moved from NIRD to Betzy.

We apologize for the inconvenience

[FINISHED] Saga downtime 17th November 12:00 -19th November 15:00

[Update: 2021-11-19 09:50] The maintenance work is now done, and Saga is back in full production and running jobs as normal.

[Update, 2021-11-18 12:40] login nodes are ready for users, users can access their data and work with it. Compute nodes are still under maintenance thus running jobs are still not possible.

[Update, 2021-11-17 12:00]: The maintenance has now started

We will conduct firmware update/maintenance on all of Saga during next week, starting on Wednesday 17th 12:00

Downtime will last until 15:00 on friday 19th, but we will bring back access to login nodes and file system as soon as the upgrade is done on vital parts of the system. Compute ndes will be brought back sequentially while they are updated.

Sigma2 router upgrade in Tromsø

Uninett will conduct a firmware upgrade of one of the routers in Tromsø this Friday, 5th November between 11:00 and 15:00. This will not affect internal networks on Fram, NIRD or NIRD Toolkit or any production on the systems, but external network may briefly disconnect or stall

If the upgrade is successful, the other router will be upgraded next week.

[Updated] Batch system issue on Betzy

There is currently an issue on Betzy with the batch system which results in jobs not completing and new jobs not being started.

We are currently investigating the issue and will update once we know what caused it and how it can be resolved.

[Update 14:22]: Job submission is working again. The users experiencing this were unfortunately victims of a batch system restart which happened at the same time as the job was submitted.

[Resolved] UiB MATLAB License server is down

Update 2021-05-10: The UiB MATLAB license server is now up and running again.

Dear users,
We have problem with UiB MATLAB license server, the license server is not stable and crashing from time to time, Users using MATLAB software from different clusters will have problem to contact UiB MATLAB license server.

we are working on this issue, and will keep you updated.

We apologise for any inconvenience caused.

Best Regards