Update 10:49: Issue is now resolved and NIRD Toolkit packages are again available for deployment.
Dear NIRD Toolkit Users,
Due to ongoing configuration changes, the NIRD Toolkit packages are currently unavailable. However, the problem is not affecting the currently running services.
We are working on completing a repository migration.
Apologies for the inconvenience. We will keep you updated.
[UPDATE, 2021-12-15 15:00] Betzy is back in prodcution again.
[UPDATE, 2021-12-15 09:00] The downtime has now started.
There will be a short downtime for Betzy next Wednesday 15th from 09:00 until 15:00 to fix remaining hardware issues.
We regret to inform you that the downtime for Betzy has been extended due to a failed hw component not being delivered in time. We hope to have the component delivered during Thursday and subsequently have the system online again in the afternoon of Thursday 9th.
The downtime has started and will continue until wednesday 8th December evening or until upgrades are done.
There will be a scheduled downtime for Betzy lasting three days starting on Monday 6th December at 08:00. Downtime will last until Thursday 9th, 20:00.
During the downtime we will conduct:
- Full upgrade of the Lustre filesystem (both servers and clients)
- Full upgrade of the infiniband firmware
- Full upgrade of the Mellanox infiniband drivers
- minor updates to other parts of the system (Slurm, configs, etc)
Please be aware that this does also affect the storage services recently moved from NIRD to Betzy.
We apologize for the inconvenience
Update 08.12.2021 18:00 : Betzy downtime is over, and system is open for users. All planned update is performed .
[Update, 2021-11-24 14:15] Now the NIRD mounts are working again.
[Update, 2021-11-24 13:30] We are back in production and jobs are running som normal again. We are missing the NIRD mounts on two of the login nodes, but are working on fixing that.
[Update, 2021-11-24 12:00] The maintenance has started now.
There will be a short 1 hour downtime for Fram on 24th November, starting at 12:00.
During downtime we will update the firmware on interconnect infiniband switches
[Update: 2021-11-19 09:50] The maintenance work is now done, and Saga is back in full production and running jobs as normal.
[Update, 2021-11-18 12:40] login nodes are ready for users, users can access their data and work with it. Compute nodes are still under maintenance thus running jobs are still not possible.
[Update, 2021-11-17 12:00]: The maintenance has now started
We will conduct firmware update/maintenance on all of Saga during next week, starting on Wednesday 17th 12:00
Downtime will last until 15:00 on friday 19th, but we will bring back access to login nodes and file system as soon as the upgrade is done on vital parts of the system. Compute ndes will be brought back sequentially while they are updated.
[UPDATE, 2021-06-08 08:00] Betzy is now up and in production again.
[UPDATE] Unfortunately, the downtime is taking longer than anticipated, and will not be finished tonight. We plan on getting Betzy up again at around 08:00 tomorrow morning.
Campusservice at NTNU will conduct maintenance on the High Voltage circuits for Non-redundant power on 7th of June 2021, between 15:00 and 20:00. All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.
Update: The file system servers have now been fixed, and we are back online again. Thank you for your patience.
We have an ongoing performance issue with Fram filesystem. We need to shut down file servers to get this fixed, and therefore need to have three hours downtime:
Wednesday 20th January between 12:00 and 15:00, Fram will be unavailable
Dear Fram users,
We have problem with Fram compute nodes, there are about 870 nodes is down due to unknown reason, we are working on the issue, and will keep you updated.
Update 2020-12-22, 20:05: Most of the compute nodes have now been brought back online. There are still a few nodes that needs more checking before being made available for jobs.
Update 2020-12-22, 18:04: The cooling system has been stable for the last hour after making some adjustments together with the vendor. We are slowly bringing up the nodes.
Update 2020-12-22, 16:01: In order to keep the cooling as stable as possible, we have decided to take down all high memory nodes. This way we can keep some of the normal compute nodes up for the time being. We are also working together with the vendor to make adjustments on the cooling system to ensure continued stability.
We are very sorry about the inconvenience.
Update 2020-12-22, 13:41: We have identified the cause to be the cooling system and are working on mitigating the issues. Most of the compute nodes must remain down while doing so, unfortunately.
Update 2020-12-24 10:30: Compute nodes shutdown again due to electrical problems in machine room, problem has been resolved according to machine room service department, we are working to take up all nodes.
Update 2020-12-24 12:10: Most of the compute nodes on Fram is back online.