Fram machine room cooling problem

Dear Fram cluster users:

We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.

Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.

Update 14:55: Some of the nodes are crashed, which means it’s possible that  some of the jobs get killed

Update 2019.07.03 10:55:  To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.

Update 2019.07.08 11:55:  Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.

NIRD and Service Platform downtime – 26.06.2019

  • 2019-07-02 08:00: All Service Platform services resumed. It might be that some of the services are not properly working and need to be restarted after the maintenance. If you experience any problem with your service, please do not hesitate to contact us asap.
  • 2019-06-27 19:58: NIRD filesystems are mounted back to Fram.
  • 2019-06-27 19:46: NIRD login nodes are started back now, you may login and access your files stored on NIRD.
    Remaining Service Platform services will be started tomorrow morning.
  • 2019-06-27 09:54: We are starting back and testing the file system now.
  • 2019-06-26 22:08: All hardware replacements are done now and the storage system is monitored for any signs of instability. Starting back of the filesystem is planned for tomorrow morning. We will keep you updated.
  • 2019-06-26 14:05: Vendor is meticulously checking each NIRD storage component and decided to replace main controller chassis.
    In the mean time we are applying firmware updates on the Service Platform to improve stability and security.
  • 2019-06-26 08:15: Maintenance has started.

Dear NIRD and Service Platform User,

We have a planned downtime on the 26th of June, Wednesday next week, to replace some defective hardware. Systems will taken offline starting from 08:00AM.

Engineer from storage vendor will assist us from the very first hour.

We expect the maintenance to finish in one day.
Will keep you updated here.

Metacenter Operations

NIRD available again

Dear NIRD and NIRD Toolkit User,

After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.

We will proceed in taking up the Service Platform during today.

Thank you for your understanding and patience!
Metacenter Operations

550 nodes down on Fram

550 nodes went down at 00:00 Monday morning. We are investigating the issue and will bring nodes back online as soon as possible