Cooling issues in Fram server room

Update:

  • 11:15  Cooling distribution units are functional again and computes started back once again.
  • 10:12  Cooling units failed once again and computes were automatically switched off. We are looking into the problem.
  • 09:22 Cooling is functional again and Fram computes are started back now and machine shall shortly be fully operational.

——————————————————————————————-

We had troubles with one of the cooling units in the Fram server room today around 06:30.
Safety mechanisms switched off biggest part of the Fram compute nodes.

Thank you for your understanding!
Metacenter Operations

Fram: scheduled downtime on the 28th of August

UPDATE

2018-08-28 18:02 Fram is up and jobs running again.

We will have a one day scheduled downtime on Fram on the 28th of August starting from 08:00 AM.

Jobs not being able to finish before the maintenance window, will be left pending in the queue with a Reason “ReqNodeNotAvail” and will be started when the maintenance is over.

We will keep you updated via OpsLog/Twitter.

Thank you for your consideration!
Metacenter Operations

Compute nodes down – Fixed

Update 2018-08-02 09:35 Most of the computes are up and we are working to fix the remaining few. Jobs are running again.

Compute nodes went down due to a power spike on 1st of August around 7 o’clock PM. We are starting back the system and will update this post as soon as the system is functional again.

Issues with job completion – FIXED

Update : 14:39 26-07-18 The issue with Fram file system is now fixed and jobs should run as normal.

We are experiencing some problems at the moment and this is most likely a file system issue. We are trying our best to bring the services back to normal, however as most of the experts are on holiday this may take longer than usual. Please check back here for updates.

Shared filesystem on Fram down

Update 2018-06-27 10:57 /cluster file system is up again on Fram.

The shared filesystem on Fram (/cluster) is currently down. We are investigating it, and are trying to get it up again as soon as possible. We will update here when we know more.

 

550 nodes down on Fram

550 nodes went down at 00:00 Monday morning. We are investigating the issue and will bring nodes back online as soon as possible

/cluster file system hanging

Some of the Lustre object storage servers crashed during the night, making parts of the /cluster file system unaccessible. We working on the problem and will keep you updated.

Metacenter Operations