Fram machine room cooling problem

Dear Fram cluster users:

We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.

Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.

Update 14:55: Some of the nodes are crashed, which means it’s possible that  some of the jobs get killed

Update 2019.07.03 10:55:  To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.

Update 2019.07.08 11:55:  Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.

NIRD available again

Dear NIRD and NIRD Toolkit User,

After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.

We will proceed in taking up the Service Platform during today.

Thank you for your understanding and patience!
Metacenter Operations

Vilje is online

The infiniband error was due to a controller module with bad connection. This has been corrected.

The queueing system is back online. Also: 19 additional nodes has been recovered.

Three jobs were lost. We apologize for the inconvenience.

 

 

Vilje is back online

Vilje is online.

The outage was caused by the loss of infiniband connectivity/loss of two infiniband switches.

36 nodes will remain out of production.

There may still be dns issues with connectivity from innside the cluster to outside (i.e: licence server lookups). Please report any issues to: support@metacenter.no

 

Emergency Stop of Fram, NIRD and the Service Platform

Update:

  • 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
  • 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.

 

Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data.  This means stopping all jobs and user processes, and logging users out of the systems.

We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today.  (Note that the NIRD project areas will _not_ be available until NIRD is up again.)

We will update this post with more information when we know more.

Cooling issues in Fram server room

Update:

  • 11:15  Cooling distribution units are functional again and computes started back once again.
  • 10:12  Cooling units failed once again and computes were automatically switched off. We are looking into the problem.
  • 09:22 Cooling is functional again and Fram computes are started back now and machine shall shortly be fully operational.

——————————————————————————————-

We had troubles with one of the cooling units in the Fram server room today around 06:30.
Safety mechanisms switched off biggest part of the Fram compute nodes.

Thank you for your understanding!
Metacenter Operations