Scheduled downtime on the 12th of February

Update:

  • 2019-02-15 11:18: We are still experiencing problems with the storage system on Fram. Disks begun to mass-fail once again after the system seemed to be stable during the night. We are depending on the vendor to resolve these issues and we are working closely with them.
    Based on the new instability we can not give an estimate for when the system will be ready for general use again. This is an unfortunate situation and we understand the impact on you, and thus we try all possible solutions to keep your data safe and bring up the system as soon as possible.

    The OpsLog will be updated with new information when the status of the situation changes.

  • 2019-02-14 13:17: Due to missing parts, and the size of the storage, disk recovery is progressing slowly ahead on approximately 50% reduced performance. Current ETA are:
    • Fram: 15.02.2019
    • NIRD: 19.02.2019
    • Service Plattform: 19.02.2019
  • 2019-02-13 19:07: Communication with the missing storage enclosures were re-established and disk pools are rebuilding at this time. Unfortunately we can not reopen machines until disk pools are stabilized. We will have a new round of checks and risk analysis tomorrow morning. Will keep you updated here.
  • 2019-02-13 11:33: Some of the parts arrived to the datacenter and we are working with the vendor on replacing and pathing the firmware on Fram. More details to follow as we know more.
  • 2019-02-12 15:38: NIRD Tromsø and Fram storages have each one disk enclosure which failed. We are waiting for replacement parts to arrive. After replacement we will have to rebuild disk pools before re-opening machines for production.Current estimate is tomorrow evening. Will keep you updated.
  • 2019-02-12 12:36: Firmware upgrade on NIRD is finished. We are proceeding to start back NIRD services. Will keep you posted.
  • 2019-02-12 08:17: Maintenance has started.
  • 2019-02-11 13:20: Due to the disk problems accellerating during the weekend, we have now changed the maintenance stop reservation so no new jobs will start until the maintenance is done.  Already running jobs will not be affected, but no new jobs will start.  This has been done to reduce the risk of data loss.

We need to have a scheduled downtime on a relatively short notice in order to upgrade the firmware on both Fram and NIRD (including NIRD Toolkit) storages.
This is a critical and mandatory update which will increase stability, performance and reliability of our systems.

The downtime is expected to last no more than a working day.

Fram jobs which can not finish by the 12th of February, are queued up and will not start until the maintenance is finished.

Thank you for your understanding!
Metacenter Operations

stallo shutdown in short notice.

2019-02-02 14:00

The electrician work in the building housing stallo showed a defect Residual-current device (RCDs). This has to be replaced, an order for a new device is in place, and once it arrive one need to cut the power in the building to replace it.

While we (HPC staff) are waitin for more information it is decided that jobs that are not finished by february 8th 08:00 will not start.

Unstable NIRD connection

We are experiencing unstable network connection between NIRD and Fram. On Fram login nodes, some of the project area might be missing.

Please try to check both login nodes (login-1-1,login-1-2) for your project area. Currently all project areas are mounted on login-1-2.

We are working on this issue and will keep you updated.

We apologize for the inconvenience it may caused you.

Fram queue system adjustment

As part of the ongoing work of tuning and improving the queue system on Fram for best resource usage and user experience, we recently upgraded Slurm to version 18.08.

This has enabled us to replace the limit of how many jobs per user the system will try to backfill with a more flexible setting.

Jobs are started and backfilled in priority order, and the priority of jobs increase with time as they are pending. With the new setting, only a fixed number of each user’s jobs within a project will increase in priority. The number is currently 10, but will probably be adjusted over time. This setting will make it easier for all users and projects to get jobs started when one user has submitted very many jobs over a short time, while at the same time not preventing jobs from starting when there are free resources.

Note that the setting is per user and project, so if more than one user submit jobs in the same project, each of them will get 10 jobs with priority increasing, and similarly, if one user submits jobs to several projects, the user will get 10 jobs with priority increasing in each project.

The setting is documented here.

DATA LOSS in NS2345K

Update 25.01.2019 11:20 Original Recover process is terminated due to limit on filesystem space. To make project data accessible, we have the following steps and procedures:

  1. Project users are able to get access to the NS2345K project snapshot in read-only mode so that they can get to the necessary data. Snapshot location:
    /projects/NS2345K/.snapshots/Tuesday-15-Jan-2019
  2. The project users will also get access to a space created for project NS2345 on the Fram side, namely /cluster/NS2345K. This is a temporary space where project users can work and store new data. This space has a nightly backup. This space is also accessible from NIRD (at moment only from login4.nird.sigma2.no) at:
    /projects/NS2345K/FRAM
  3. In the meantime, we are working on recovering NS2345K project space.

We will keep you updated.

Update 15:25 Recovery process is still scanning through the missing inodes from the snapshots and dispatching the operation to related nodes. When this is done, files will start to recover.

Update 11:50 We have locked the project to avoid interference with the restoration process.

We have lost some of the data in project NS2345K. We are in the process of recovering the lost data, affected files will hopefully be gradually recovered.