[UPDATE, 2021-06-08 08:00] Betzy is now up and in production again.
[UPDATE] Unfortunately, the downtime is taking longer than anticipated, and will not be finished tonight. We plan on getting Betzy up again at around 08:00 tomorrow morning.
Campusservice at NTNU will conduct maintenance on the High Voltage circuits for Non-redundant power on 7th of June 2021, between 15:00 and 20:00. All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.
Today, Saga has been extended with 120 new compute nodes, increasing the total number of CPUs on the cluster from 9824 to 16064.
The new nodes have been added to the normal partition. They are identical to the old compute nodes in the partition, except that they have 52 CPU cores instead of 40.
We hope this extension will reduce the wait time for normal jobs on Saga.
We have identified a hardware issue with login-2.betzy.sigma2.com
This means there is reduced capacity on betzy login nodes until further notice.
This will probably not be fixed until after Easter.
Dear Fram users,
We have problem with Fram compute nodes, there are about 870 nodes is down due to unknown reason, we are working on the issue, and will keep you updated.
Update 2020-12-22, 20:05: Most of the compute nodes have now been brought back online. There are still a few nodes that needs more checking before being made available for jobs.
Update 2020-12-22, 18:04: The cooling system has been stable for the last hour after making some adjustments together with the vendor. We are slowly bringing up the nodes.
Update 2020-12-22, 16:01: In order to keep the cooling as stable as possible, we have decided to take down all high memory nodes. This way we can keep some of the normal compute nodes up for the time being. We are also working together with the vendor to make adjustments on the cooling system to ensure continued stability.
We are very sorry about the inconvenience.
Update 2020-12-22, 13:41: We have identified the cause to be the cooling system and are working on mitigating the issues. Most of the compute nodes must remain down while doing so, unfortunately.
Update 2020-12-24 10:30: Compute nodes shutdown again due to electrical problems in machine room, problem has been resolved according to machine room service department, we are working to take up all nodes.
Update 2020-12-24 12:10: Most of the compute nodes on Fram is back online.
As previously announced, Saga will be down in the coming week, from 7th December 08:00 until 11th December 16:00.
The downtime is allocated for expanding the storage. When we come back we will have ca 4 Petabyte in addition to the already existing 1 PetaByte.
Update: Saga is back online and running jobs again. The new storage is not online yet, but all the hardware has been mounted.
As of today, Wednesday 4th, November at 08:00, Fram is down for maintenance. We will do the same exercise as on NIRD-TOS, namely change all internal cables on the storage system.
17:20 NIRD-TOS and services are now up.
Dear NIRD and NIRD Toolkit user: NIRD-TOS is currently down and will remain unavailable until Wednesday 12:00. We are replacing all cables during the next coupe of days.
Note that NIRD-home is also not available during that time.
All remote mounts on Fram, Saga and Betzy using NIRD-TOS will be unavailable until downtime is over
We will have downtime the following week to try again to replace all internal cables in NIRD-TOS and Fram storage systems.
NIRD-TOS (Including the toolkit) will be down from 08:00 Monday 2nd November to wednesday 4th 12:00
Fram will be down from Wednesday 4th 08:00 until Friday 6th 12:00
There is still a chance that the downtime will not happen, but proper notification will be given in the opslog. Unfortunately the current situation with Covid-19 makes it difficult to make detailed plans.
We apologize for any inconvenience.
The downtime for NIRD-TOS on 26th October until 29th October is cancelled and the downtime for Fram from 28th October until 29th of October is cancelled.
New dates for the downtime will be announced monday 26th or tuesday 27th.
During the downtime we will replace all internal cables between disk controllers and disk enclosures. The firmware upgrade two weeks ago helped a lot, but we are still seeing ccommunication errors so the decision is to remove all cables and replace them.
- 08.10.2020: After extensive testing, the vendor found stability issues are unfortunately still present. The problem is escalated and under investigation. We will get back to you with more information as soon as we get an update from the vendor.
- 30.09.2020: The vendor will carry out firmware updates on Betzy during today and as a consequence we need to stop running jobs and run tests to make sure the system is table.
Access to the machine will be reopened as soon as we are ready with the tests. Please follow the progress here, on OpsLog.
- 25.09.2020: We are temporarily reopening the access over the weekend in order to allow further testing on the machine.
Further work is expected to be done by the vendor sometime next week and as a consequence, jobs will be terminated again and access closed while maintenance will be ongoing.
Dear Betzy pilots,
We are pleased to announce that despite logistics challenges caused by Covid-19, most of the outstanding issues were sorted out. This unusual situation requested a more dynamic approach from everyone involved, while putting pressure on the communication due to uncertainties and quick situation changes. Because of this, setting and advertising a production date proved to be difficult.
We can now start aiming for setting Betzy into production in the beginning of October. Before we can conclude, and proceed with the preparations, we need to re-run several comprehensive tests.
Therefore, we will have to stop all jobs and access to Betzy starting from tomorrow, 17 September 2020 10AM. Access to Betzy will be re-established as soon as all the tests are effectuated. Please be prepared for a more extensive maintenance this time, which might require up to two and half weeks.
The file system on Betzy is not going to be reformatted. That is, your data will not be removed intentionally. However, we can not guarantee data integrity until backups are taken and the machine is placed into production. Therefore, we strongly advise you to take a backup of your important data for the sake of security.
Apologies for the short notice and the inconvenience this is causing to you.
Lorand Szentannai, on behalf of the preparations team