DATA LOSS in NS2345K

Update 25.01.2019 11:20 Original Recover process is terminated due to limit on filesystem space. To make project data accessible, we have the following steps and procedures:

  1. Project users are able to get access to the NS2345K project snapshot in read-only mode so that they can get to the necessary data. Snapshot location:
    /projects/NS2345K/.snapshots/Tuesday-15-Jan-2019
  2. The project users will also get access to a space created for project NS2345 on the Fram side, namely /cluster/NS2345K. This is a temporary space where project users can work and store new data. This space has a nightly backup. This space is also accessible from NIRD (at moment only from login4.nird.sigma2.no) at:
    /projects/NS2345K/FRAM
  3. In the meantime, we are working on recovering NS2345K project space.

We will keep you updated.

Update 15:25 Recovery process is still scanning through the missing inodes from the snapshots and dispatching the operation to related nodes. When this is done, files will start to recover.

Update 11:50 We have locked the project to avoid interference with the restoration process.

We have lost some of the data in project NS2345K. We are in the process of recovering the lost data, affected files will hopefully be gradually recovered.

HW problems on Fram and NIRD storages

Updates:

  • 2019-01-11 08:30: We are starting to rebuild the remaining degraded storage pools. The storage vendor is analyzing further logs and working on a new firmware for our systems.
  • 2019-01-07 11:30: The disk system on Fram is hopefully back to normal soon, but further disks in the NIRD filesystem failed during the weekend, and need to be replaced.
  • 2019-01-04 15:29: Raidsets are now rebuilding and we expect them to be finished within 24 hours.

We are experiencing hardware failures on both Fram and NIRD storages. Due to disk losses performance is also slightly degraded at this point.

To mitigate those issues we will have to reseat IO modules on the controllers and this might cause IO hang. Will keep you updated.

HW failures on Fram storage

Update 2018-12-21: HW is replaced and /cluster file system should be 100% functional again.

We have some hardware failing on the Fram storage needing urgent replacement. For this we have to failover disks served by two Lustre servers to other Lustre server nodes.
Some slow down and short I/O hanging might be encountered on the /cluster file system during the maintenance.

We apologies for the inconvenience.

[RESOLVED] Short file system hang on /nird/home on Fram

Update: There was a second hang at around 09. The reason for the hangs has been found and fixed.

There was a short hang on the /nird/home ($HOME) file system on Fram from 08:00 to 08:55 today. The file system is back to normal now. We are investigating the reason for it.

Jobs running in $SCRATCH, $USERWORK or in a project directory have most likely not been affected, but it is probably a good idea to check the status of your jobs.

Shared filesystem on Fram down

Update 2018-06-27 10:57 /cluster file system is up again on Fram.

The shared filesystem on Fram (/cluster) is currently down. We are investigating it, and are trying to get it up again as soon as possible. We will update here when we know more.

 

/cluster file system hanging

Some of the Lustre object storage servers crashed during the night, making parts of the /cluster file system unaccessible. We working on the problem and will keep you updated.

Metacenter Operations

Problems with $HOME on Fram

Some of the compute nodes and additionally Fram login nodes lost connection to the NFS mounted $HOME.
Login nodes were rebooted to cleanup hanging processes and blocking I/O.

We are investigating this issue and working on a solution.

Thank you for your understanding!
Metacenter Operations

$HOME file system availability issues on Fram – FIXED

We are experiencing availability issues for $HOME file system on Fram. The problem is currently under investigation and we are actively working on solving it.
Update 09:30:
Problem is fixed now.
One of the file servers exporting $HOME  went down and the failover didn’t work as intended.

Thank you for your understanding!
Metacenter Operations

Issues on /cluster file system

We have identified a  bug on the /cluster file system which can lead to random job crashes.

The bug is triggered on the Lustre file system by a combination of running Fortran code compiled with Intel MPI.

A bug report is filed now to the storage vendor.

We will keep you updated!

Update 06-04-2018: We have found and fixed a problem on the file servers and with the tests we ran, we can not reproduce the problem anymore.

Thank you for your consideration!
Metacenter Operations