Missing mounts for Project NS2345K on NIRD
The NFS server which is exporting /tos-project1/NS2345K/FRAM to NIRD/FRAM has crashed yesterday around 16:00. We have recovered the NFS server and /tos-project1/NS2345K/FRAM is re-exported agian. We are currently working on mounting the filesystem in login containers and meanwhile investigating the cause. We are sorry for the inconveniences caused.
- 2019-04-03 13:25: NIRD and the service platform are back into production.
- 2019-04-03 10:59: Maintenance work has finished. We are proceeding in starting back the filesystems and services.
- 2019-04-03 08:22: Disk expansion and rebalancing is finished. HW checks are currently ongoing and shall finish in a couple of hours. Will keep you posted.
- 2019-04-02 09:55: NIRD filesystems are unmounted from Fram and replicated data is available read-only trough login-trd.nird.sigma2.no
- 2019-04-02 08:06: Maintenance work has started.
Dear NIRD User,
NIRD and the Service Platform will be under maintenance to expand the disk capacity in Tromsø.
The operations for storage expansion and disk pool rebalancing will start on the 2nd of April at 8:00 am CET and will last for maximum 2 days. During the maintenance, the services running on the NIRD Service Platform and on the NIRD Toolkit will not be available.
During the downtime we plan to make project data mirrored to Trondheim available in read-only mode trough a specially built login node. This solution will be first tested with real load during this downtime, thus we might encounter some technical difficulties.
That being said, to access the remote, mirrored data, please login to login-trd.nird.sigma2.no.
We apologise for the inconvenience.
Servers running the Service Platform and NIRD login nodes, have some issues with the remote filesystems.
The problem is already identified and being taken care of, but you might experience short hiccups until the problem is fixed on all the affected nodes.
You will have to re-login to NIRD login nodes.
We expect to be ready in maximum one hour.
Dear NIRD and NIRD Toolkit User,
After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.
We will proceed in taking up the Service Platform during today.
Thank you for your understanding and patience!
- 2019-02-26 08:57: Fram parts have arrived and were installed yesterday. The vendor will start rebuilding the disk pools on Fram today.
- 2019-02-22 09:10: Access to NIRD is reopened. We will proceed in taking up the Service Platform during today. Some disk pools are still under rebuild and should be finished in few hours. So until then, you might encounter slight performance loss.
- 2019-02-21 15:40: Most of disk rebuild on NIRD are finished. NIRD file system is started and we will proceed with opening access to NIRD as soon as remaining network issues are sorted out on the backbone.
Due to logistic issues, Fram parts are held back at the customs. The vendor sent a second batch of parts through another logistic company and ETA is Monday morning.
We will post a message when access to NIRD is re-opened.
- 2019-02-19 14:10: Disk rebuilds on NIRD have reached 40%. Current ETA for NIRD is Thursday morning.
We are still waiting for the Fram parts to arrive to Tromsø.
- 2019-02-18 16:36: Some disk pools must be rebuilt once again for NIRD, thus delaying the opening once more. We are terribly sorry for this. Will continue updating the log as soon as new information is available.
- 2019-02-18 10:10:
NIRD storage is stabilized now and the vendor will do a new attempt of taking the system back online during today. At this stage it is still uncertain when Fram can be put back into production.
- 2019-02-15 11:18: We are still experiencing problems with the storage system on Fram. Disks begun to mass-fail once again after the system seemed to be stable during the night. We are depending on the vendor to resolve these issues and we are working closely with them.
Based on the new instability we can not give an estimate for when the system will be ready for general use again. This is an unfortunate situation and we understand the impact on you, and thus we try all possible solutions to keep your data safe and bring up the system as soon as possible.The OpsLog will be updated with new information when the status of the situation changes.
- 2019-02-14 13:17: Due to missing parts, and the size of the storage, disk recovery is progressing slowly ahead on approximately 50% reduced performance. Current ETA are:
- Fram: 15.02.2019
- NIRD: 19.02.2019
- Service Plattform: 19.02.2019
- 2019-02-13 19:07: Communication with the missing storage enclosures were re-established and disk pools are rebuilding at this time. Unfortunately we can not reopen machines until disk pools are stabilized. We will have a new round of checks and risk analysis tomorrow morning. Will keep you updated here.
- 2019-02-13 11:33: Some of the parts arrived to the datacenter and we are working with the vendor on replacing and pathing the firmware on Fram. More details to follow as we know more.
- 2019-02-12 15:38: NIRD Tromsø and Fram storages have each one disk enclosure which failed. We are waiting for replacement parts to arrive. After replacement we will have to rebuild disk pools before re-opening machines for production.Current estimate is tomorrow evening. Will keep you updated.
- 2019-02-12 12:36: Firmware upgrade on NIRD is finished. We are proceeding to start back NIRD services. Will keep you posted.
- 2019-02-12 08:17: Maintenance has started.
- 2019-02-11 13:20: Due to the disk problems accellerating during the weekend, we have now changed the maintenance stop reservation so no new jobs will start until the maintenance is done. Already running jobs will not be affected, but no new jobs will start. This has been done to reduce the risk of data loss.
We need to have a scheduled downtime on a relatively short notice in order to upgrade the firmware on both Fram and NIRD (including NIRD Toolkit) storages.
This is a critical and mandatory update which will increase stability, performance and reliability of our systems.
The downtime is expected to last no more than a working day.
Fram jobs which can not finish by the 12th of February, are queued up and will not start until the maintenance is finished.
Thank you for your understanding!
Update 25.01.2019 11:20 Original Recover process is terminated due to limit on filesystem space. To make project data accessible, we have the following steps and procedures:
- Project users are able to get access to the NS2345K project snapshot in read-only mode so that they can get to the necessary data. Snapshot location:
- The project users will also get access to a space created for project NS2345 on the Fram side, namely /cluster/NS2345K. This is a temporary space where project users can work and store new data. This space has a nightly backup. This space is also accessible from NIRD (at moment only from login4.nird.sigma2.no) at:
- In the meantime, we are working on recovering NS2345K project space.
We will keep you updated.
Update 15:25 Recovery process is still scanning through the missing inodes from the snapshots and dispatching the operation to related nodes. When this is done, files will start to recover.
Update 11:50 We have locked the project to avoid interference with the restoration process.
We have lost some of the data in project NS2345K. We are in the process of recovering the lost data, affected files will hopefully be gradually recovered.
- 2019-01-11 08:30: We are starting to rebuild the remaining degraded storage pools. The storage vendor is analyzing further logs and working on a new firmware for our systems.
- 2019-01-07 11:30: The disk system on Fram is hopefully back to normal soon, but further disks in the NIRD filesystem failed during the weekend, and need to be replaced.
- 2019-01-04 15:29: Raidsets are now rebuilding and we expect them to be finished within 24 hours.
We are experiencing hardware failures on both Fram and NIRD storages. Due to disk losses performance is also slightly degraded at this point.
To mitigate those issues we will have to reseat IO modules on the controllers and this might cause IO hang. Will keep you updated.
Login nodes are behind a load balancer which is resetting all connections every 4 hours. We are working on implementing changes which will make connection more consistent over the time.
If you have problem logging into NIRD, remove login.nird entry from the authorized_keys files.
update 16-11-18 9:00: NIRD and service platform are up again.
Due to disk failures on NIRD, we have to shut down NIRD and the Service Platform immediately to avoid losing user data.
Sorry for the inconvenience.
- 2018-11-23 09:45:
- We plan to re-open access to Fram during today. Will keep you updated.
- We are currently running benchmarks and tests on the upgraded system.
- Cooling units are stress-tested now to pinpoint any outstanding issues.
- OpenMPI is upgraded now.
- 2018-11-22 22:48:
- Lustre servers are upgraded on Fram and several tests, benchmarks were run to fine tune parameters.
- First step of the CDU maintenance is carried out now.
- 2018-11-22 11:33:
- NIRD Service Platform is up again.
- Access to NIRD is reopened. Please note that we have now four login nodes and SSH fingerprint is changed.
- 2018-11-21 18:41:
- Needed hardware replacement for NIRD was carried out and all firmware upgrades are finalized on both Tromsø and Trondheim site storage systems.
- NIRD file systems are started back now and we plan to reopen access tomorrow before noon.
- Firmware is updated now on the Fram storage system.
- Several other updates, including the OpenFabrics stack and Lustre, were done in parallel.
- 2018-11-21 08:00: Maintenance has started.
Dear Fram, NIRD and Service Platform User,
On the 21st and 22nd of November we will have a scheduled maintenance on Fram, NIRD and Service Platform.
This will be a comprehensive maintenance on the national HPC and research data infrastructure, ongoing on multiple levels and sites. Due to it’s complexity and amount of work involved, some parts of the infrastructure might require downtime extension for the 23rd of November, too.
The work will include, but will not be limited to:
– firmware upgrades on disks, enclosures, chassis, etc.
– operative system upgrades
– queue system upgrade
– file system upgrades
– kernel upgrades
– upgrade of OpenMPI
– upgrades on the OpenFabrics stack
– maintenance on the cooling system units
Our aim is to enhance the stability and security of the infrastructure, eliminate bugs and enhance performance, while having the shortest downtime possible.
We understand that system unavailability has big impact on your daily work and such we try to bring back our systems functional as soon as possible.
Thank you for your consideration!