NIRD and service platform downtime on Thursday 22nd of August.

Update:

  • 2019-08-26 12:45: NIRD project areas are mounted on Fram login nodes.
  • 2019-08-25 14:15:Service Platform is up now, you can login now to NIRD and access your files.
    NIRD project areas will be reconnected to Fram tomorrow.
  • 2019-08-23 18:42: Vendor started a forced health check on the system which is taking more time then expected. We will re-open access to NIRD and Service Platform as soon as checks and rebuilds are finished.
  • 2019-08-23 08:05: Storage vendor has finished the hardware replacements and installation of new firmware on the storage system.
    We are currently monitoring the storage system together with the vendor.

Dear NIRD and Service Platform users,

We have a planned downtime on the 22nd of August, to replace some defective hardware. Systems will be taken offline starting from 08:00AM.

Engineer from storage vendor will assist us from the very first hour.

We expect the maintenance to finish in one and a half day.

NIRD projects will still be accessible during the maintenance

from login-trd.nird.sigma2.no but in read-only mode.

Will keep you updated here.

Sorry for the short notice.

Metacenter Operations

Fram machine room cooling problem

Dear Fram cluster users:

We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.

Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.

Update 14:55: Some of the nodes are crashed, which means it’s possible that  some of the jobs get killed

Update 2019.07.03 10:55:  To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.

Update 2019.07.08 11:55:  Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.

NIRD and Service Platform downtime – 26.06.2019

  • 2019-07-02 08:00: All Service Platform services resumed. It might be that some of the services are not properly working and need to be restarted after the maintenance. If you experience any problem with your service, please do not hesitate to contact us asap.
  • 2019-06-27 19:58: NIRD filesystems are mounted back to Fram.
  • 2019-06-27 19:46: NIRD login nodes are started back now, you may login and access your files stored on NIRD.
    Remaining Service Platform services will be started tomorrow morning.
  • 2019-06-27 09:54: We are starting back and testing the file system now.
  • 2019-06-26 22:08: All hardware replacements are done now and the storage system is monitored for any signs of instability. Starting back of the filesystem is planned for tomorrow morning. We will keep you updated.
  • 2019-06-26 14:05: Vendor is meticulously checking each NIRD storage component and decided to replace main controller chassis.
    In the mean time we are applying firmware updates on the Service Platform to improve stability and security.
  • 2019-06-26 08:15: Maintenance has started.

Dear NIRD and Service Platform User,

We have a planned downtime on the 26th of June, Wednesday next week, to replace some defective hardware. Systems will taken offline starting from 08:00AM.

Engineer from storage vendor will assist us from the very first hour.

We expect the maintenance to finish in one day.
Will keep you updated here.

Metacenter Operations

Fram development queue

Dear Fram User,

As of today we have adjusted the queue system policies to facilitate code development and testing on Fram and meanwhile limit possible misuse of devel queue.

devel is now adjusted to allow:

  • max 4 node jobs
  • max 30 minutes wall time
  • max 1 job per user

We have additionally introduced a short queue with following settings:

  • max 10 node jobs
  • max 120 minutes wall time
  • max 2 jobs per user

We will continue to monitor and improve the queue system. Please stay tuned.
You may find more information here.

Metacenter Operations

Fram MDS patched

Dear Fram User,

This morning around 09:05, once again has the Fram metadata server crashed and likely had impact on running jobs.

A mitigating patch was delivered by the vendor yesterday and we used this opportunity to apply it on our metadata servers.

We will keep the system closely monitored and cooperate with the vendor on further stabilizing the system.

Apologies for any inconvenience this may have caused!