NIRD and service platform downtime on Thursday 22nd of August.

Update:

  • 2019-08-26 12:45: NIRD project areas are mounted on Fram login nodes.
  • 2019-08-25 14:15:Service Platform is up now, you can login now to NIRD and access your files.
    NIRD project areas will be reconnected to Fram tomorrow.
  • 2019-08-23 18:42: Vendor started a forced health check on the system which is taking more time then expected. We will re-open access to NIRD and Service Platform as soon as checks and rebuilds are finished.
  • 2019-08-23 08:05: Storage vendor has finished the hardware replacements and installation of new firmware on the storage system.
    We are currently monitoring the storage system together with the vendor.

Dear NIRD and Service Platform users,

We have a planned downtime on the 22nd of August, to replace some defective hardware. Systems will be taken offline starting from 08:00AM.

Engineer from storage vendor will assist us from the very first hour.

We expect the maintenance to finish in one and a half day.

NIRD projects will still be accessible during the maintenance

from login-trd.nird.sigma2.no but in read-only mode.

Will keep you updated here.

Sorry for the short notice.

Metacenter Operations

Fram machine room cooling problem

Dear Fram cluster users:

We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.

Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.

Update 14:55: Some of the nodes are crashed, which means it’s possible that  some of the jobs get killed

Update 2019.07.03 10:55:  To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.

Update 2019.07.08 11:55:  Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.

NIRD available again

Dear NIRD and NIRD Toolkit User,

After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.

We will proceed in taking up the Service Platform during today.

Thank you for your understanding and patience!
Metacenter Operations

Vilje is online

The infiniband error was due to a controller module with bad connection. This has been corrected.

The queueing system is back online. Also: 19 additional nodes has been recovered.

Three jobs were lost. We apologize for the inconvenience.