Missing mounts for Project NS2345K on NIRD

Missing mounts for Project NS2345K on NIRD
The NFS server which is exporting /tos-project1/NS2345K/FRAM to NIRD/FRAM has crashed yesterday around 16:00. We have recovered the NFS server and /tos-project1/NS2345K/FRAM is re-exported agian. We are currently working on mounting the filesystem in login containers and meanwhile investigating the cause. We are sorry for the inconveniences caused.

Scheduled downtime – NIRD storage expansion – 2nd of April

Update:

  • 2019-04-03 13:25: NIRD and the service platform are back into production.
  • 2019-04-03 10:59: Maintenance work has finished. We are proceeding in starting back the filesystems and services.
  • 2019-04-03 08:22: Disk expansion and rebalancing is finished. HW checks are currently ongoing and shall finish in a couple of hours. Will keep you posted.
  • 2019-04-02 09:55: NIRD filesystems are unmounted from Fram and replicated data is available read-only trough login-trd.nird.sigma2.no
  • 2019-04-02 08:06: Maintenance work has started.

Dear NIRD User,

NIRD and the Service Platform will be under maintenance to expand the disk capacity in Tromsø.

The operations for storage expansion and disk pool rebalancing will start on the 2nd of April at 8:00 am CET and will last for maximum 2 days. During the maintenance, the services running on the NIRD Service Platform and on the NIRD Toolkit will not be available.

During the downtime we plan to make project data mirrored to Trondheim available in read-only mode trough a specially built login node. This solution will be first tested with real load during this downtime, thus we might encounter some technical difficulties.
That being said, to access the remote, mirrored data, please login to login-trd.nird.sigma2.no.

We apologise for the inconvenience.
Metacenter Operations

MDS crash 13.03.2019

Main MDS server for lustre filesystem crashed between 14:00 and 14:30, And secondary MDS server took over and restored filesystem around 14:40 . Some of the jobs running on Fram might be affected. We are investigating the root cause of the incident.

Short outage on Service Platform

Servers running the Service Platform and NIRD login nodes, have some issues with the remote filesystems.
The problem is already identified and being taken care of, but you might experience short hiccups until the problem is fixed on all the affected nodes.
You will have to re-login to NIRD login nodes.

We expect to be ready in maximum one hour.

NIRD available again

Dear NIRD and NIRD Toolkit User,

After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.

We will proceed in taking up the Service Platform during today.

Thank you for your understanding and patience!
Metacenter Operations

DATA LOSS in NS2345K

Update 25.01.2019 11:20 Original Recover process is terminated due to limit on filesystem space. To make project data accessible, we have the following steps and procedures:

  1. Project users are able to get access to the NS2345K project snapshot in read-only mode so that they can get to the necessary data. Snapshot location:
    /projects/NS2345K/.snapshots/Tuesday-15-Jan-2019
  2. The project users will also get access to a space created for project NS2345 on the Fram side, namely /cluster/NS2345K. This is a temporary space where project users can work and store new data. This space has a nightly backup. This space is also accessible from NIRD (at moment only from login4.nird.sigma2.no) at:
    /projects/NS2345K/FRAM
  3. In the meantime, we are working on recovering NS2345K project space.

We will keep you updated.

Update 15:25 Recovery process is still scanning through the missing inodes from the snapshots and dispatching the operation to related nodes. When this is done, files will start to recover.

Update 11:50 We have locked the project to avoid interference with the restoration process.

We have lost some of the data in project NS2345K. We are in the process of recovering the lost data, affected files will hopefully be gradually recovered.

HW problems on Fram and NIRD storages

Updates:

  • 2019-01-11 08:30: We are starting to rebuild the remaining degraded storage pools. The storage vendor is analyzing further logs and working on a new firmware for our systems.
  • 2019-01-07 11:30: The disk system on Fram is hopefully back to normal soon, but further disks in the NIRD filesystem failed during the weekend, and need to be replaced.
  • 2019-01-04 15:29: Raidsets are now rebuilding and we expect them to be finished within 24 hours.

We are experiencing hardware failures on both Fram and NIRD storages. Due to disk losses performance is also slightly degraded at this point.

To mitigate those issues we will have to reseat IO modules on the controllers and this might cause IO hang. Will keep you updated.

HW failures on Fram storage

Update 2018-12-21: HW is replaced and /cluster file system should be 100% functional again.

We have some hardware failing on the Fram storage needing urgent replacement. For this we have to failover disks served by two Lustre servers to other Lustre server nodes.
Some slow down and short I/O hanging might be encountered on the /cluster file system during the maintenance.

We apologies for the inconvenience.