Planned maintenance on Fram on 16.10.2019

Update:

  • 2019-10-18 14:36 We are ready with the reinstallation, configuration checks, QA and tests. Access to the machine has been reopened and queued jobs are running again.
  • 2019-10-18 06:12 Reinstallation of compute nodes is much slower then anticipated and thus re-opening of the machine is delayed. We do our best to finish the maintenance as soon as possible. In parallel we are conducting tests and benchmarks.
    Will keep you updated.
  • 2019-10-17 08:25 File system servers and infrastructure switches were patched yesterday.
    We are proceeding now with the upgrade of the service and the login nodes.
  • 2019-10-16 08:07 Maintenance has started.

Dear Fram User,

We will have a two days planned downtime starting from 08:00AM on the 16th of October for maintenance on the storage and the file system.

During this time we will, together with the vendor, upgrade the storage firmwares, upgrade the software on the /cluster file system servers and upgrade the operating system on Fram.

This upgrade is necessary to fix the frequent issues with the metadata servers and enhance stability and security of the system.

Fram jobs which can not finish by the 16th of October, are queued up and will not start until the maintenance is finished.

Thank you for your consideration!

Metacenter Operations

NIRD and service platform downtime on Thursday 22nd of August.

Update:

  • 2019-08-26 12:45: NIRD project areas are mounted on Fram login nodes.
  • 2019-08-25 14:15:Service Platform is up now, you can login now to NIRD and access your files.
    NIRD project areas will be reconnected to Fram tomorrow.
  • 2019-08-23 18:42: Vendor started a forced health check on the system which is taking more time then expected. We will re-open access to NIRD and Service Platform as soon as checks and rebuilds are finished.
  • 2019-08-23 08:05: Storage vendor has finished the hardware replacements and installation of new firmware on the storage system.
    We are currently monitoring the storage system together with the vendor.

Dear NIRD and Service Platform users,

We have a planned downtime on the 22nd of August, to replace some defective hardware. Systems will be taken offline starting from 08:00AM.

Engineer from storage vendor will assist us from the very first hour.

We expect the maintenance to finish in one and a half day.

NIRD projects will still be accessible during the maintenance

from login-trd.nird.sigma2.no but in read-only mode.

Will keep you updated here.

Sorry for the short notice.

Metacenter Operations

Fram machine room cooling problem

Dear Fram cluster users:

We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.

Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.

Update 14:55: Some of the nodes are crashed, which means it’s possible that  some of the jobs get killed

Update 2019.07.03 10:55:  To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.

Update 2019.07.08 11:55:  Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.