We need to have a scheduled downtime on a relatively short notice in order to upgrade the firmware on both Fram and NIRD (including NIRD Toolkit) storages.
This is a critical and mandatory update which will increase stability, performance and reliability of our systems.
The downtime is expected to last no more than a working day.
Fram jobs which can not finish by the 12th of February, are queued up and will not start until the maintenance is finished.
Thank you for your understanding!
Update 25.01.2019 11:20 Original Recover process is terminated due to limit on filesystem space. To make project data accessible, we have the following steps and procedures:
- Project users are able to get access to the NS2345K project snapshot in read-only mode so that they can get to the necessary data. Snapshot location:
- The project users will also get access to a space created for project NS2345 on the Fram side, namely /cluster/NS2345K. This is a temporary space where project users can work and store new data. This space has a nightly backup. This space is also accessible from NIRD (at moment only from login4.nird.sigma2.no) at:
- In the meantime, we are working on recovering NS2345K project space.
We will keep you updated.
Update 15:25 Recovery process is still scanning through the missing inodes from the snapshots and dispatching the operation to related nodes. When this is done, files will start to recover.
Update 11:50 We have locked the project to avoid interference with the restoration process.
We have lost some of the data in project NS2345K. We are in the process of recovering the lost data, affected files will hopefully be gradually recovered.
- 2019-01-11 08:30: We are starting to rebuild the remaining degraded storage pools. The storage vendor is analyzing further logs and working on a new firmware for our systems.
- 2019-01-07 11:30: The disk system on Fram is hopefully back to normal soon, but further disks in the NIRD filesystem failed during the weekend, and need to be replaced.
- 2019-01-04 15:29: Raidsets are now rebuilding and we expect them to be finished within 24 hours.
We are experiencing hardware failures on both Fram and NIRD storages. Due to disk losses performance is also slightly degraded at this point.
To mitigate those issues we will have to reseat IO modules on the controllers and this might cause IO hang. Will keep you updated.
Login nodes are behind a load balancer which is resetting all connections every 4 hours. We are working on implementing changes which will make connection more consistent over the time.
If you have problem logging into NIRD, remove login.nird entry from the authorized_keys files.
update 16-11-18 9:00: NIRD and service platform are up again.
Due to disk failures on NIRD, we have to shut down NIRD and the Service Platform immediately to avoid losing user data.
Sorry for the inconvenience.
- 2018-11-23 09:45:
- We plan to re-open access to Fram during today. Will keep you updated.
- We are currently running benchmarks and tests on the upgraded system.
- Cooling units are stress-tested now to pinpoint any outstanding issues.
- OpenMPI is upgraded now.
- 2018-11-22 22:48:
- Lustre servers are upgraded on Fram and several tests, benchmarks were run to fine tune parameters.
- First step of the CDU maintenance is carried out now.
- 2018-11-22 11:33:
- NIRD Service Platform is up again.
- Access to NIRD is reopened. Please note that we have now four login nodes and SSH fingerprint is changed.
- 2018-11-21 18:41:
- Needed hardware replacement for NIRD was carried out and all firmware upgrades are finalized on both Tromsø and Trondheim site storage systems.
- NIRD file systems are started back now and we plan to reopen access tomorrow before noon.
- Firmware is updated now on the Fram storage system.
- Several other updates, including the OpenFabrics stack and Lustre, were done in parallel.
- 2018-11-21 08:00: Maintenance has started.
Dear Fram, NIRD and Service Platform User,
On the 21st and 22nd of November we will have a scheduled maintenance on Fram, NIRD and Service Platform.
This will be a comprehensive maintenance on the national HPC and research data infrastructure, ongoing on multiple levels and sites. Due to it’s complexity and amount of work involved, some parts of the infrastructure might require downtime extension for the 23rd of November, too.
The work will include, but will not be limited to:
– firmware upgrades on disks, enclosures, chassis, etc.
– operative system upgrades
– queue system upgrade
– file system upgrades
– kernel upgrades
– upgrade of OpenMPI
– upgrades on the OpenFabrics stack
– maintenance on the cooling system units
Our aim is to enhance the stability and security of the infrastructure, eliminate bugs and enhance performance, while having the shortest downtime possible.
We understand that system unavailability has big impact on your daily work and such we try to bring back our systems functional as soon as possible.
Thank you for your consideration!
2018-11-12:11:55: Login node and services are back into production.
2018-11-12 10:20: Disk pool raid sets were rebuilt until Saturday, but a set of drives failed once again. A new rebuild was ongoing and we had to reset IO card and power cycle the storage today. At this point all is up and functional on the storage side and file system is up. We are currently switching back geo-replication and expect to reopen access around 12:00 PM today. Will keep you posted.
2018-11-09 13:59: The firmware is now applied without any problem. However we still need to wait for a rebuild to finish. The time estimate for the rebuild is 12 hours left. We will open the system for regular use as soon as we can.
2018-11-09 12:45: Most of the rebuilds are ready and we are currently patching the firmware on the disk enclosures. If all goes well, we expect to have NIRD up and functional during the day today. Will keep you updated.
2018-11-08 13:27: The firmware update is running. We have to wait for rebuild of broken drives before we can upgrade the enclosures and finnish up the emergency maintenance. We don’t expect the rebuild to be finished before tomorrow (friday november 9th). Hence the system in whole will not be available before tomorrow.
We are very sorry any inconvenience this may cause.
- 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
- 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.
Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data. This means stopping all jobs and user processes, and logging users out of the systems.
We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today. (Note that the NIRD project areas will _not_ be available until NIRD is up again.)
We will update this post with more information when we know more.
The NIRD service platform will undergo a maintenance on Wednesday 17 October between 9 am and 5 pm.
Short downtime of the services running on the platform might be expected during that day.
Sorry for the inconvenience.
We have put in place a second NIRD login node.
This node is accessible at
Report problems to firstname.lastname@example.org.