Issues with job completion – FIXED

Update : 14:39 26-07-18 The issue with Fram file system is now fixed and jobs should run as normal.

We are experiencing some problems at the moment and this is most likely a file system issue. We are trying our best to bring the services back to normal, however as most of the experts are on holiday this may take longer than usual. Please check back here for updates.

Shared filesystem on Fram down

Update 2018-06-27 10:57 /cluster file system is up again on Fram.

The shared filesystem on Fram (/cluster) is currently down. We are investigating it, and are trying to get it up again as soon as possible. We will update here when we know more.

 

550 nodes down on Fram

550 nodes went down at 00:00 Monday morning. We are investigating the issue and will bring nodes back online as soon as possible

/cluster file system hanging

Some of the Lustre object storage servers crashed during the night, making parts of the /cluster file system unaccessible. We working on the problem and will keep you updated.

Metacenter Operations

Cooling system issues – some jobs crashed

A sudden high pressure in the cooling system around 13 o’clock, has taken one of the cooling units down. Starting it back affected the other unit as well.
This triggered a safety stop for some of the computing nodes, leading to premature crash for some of the running jobs.

Affected jobs has been re-queued.

Apologies for the inconvenience it has created.
Metacenter Operations

Planned maintenance on 12th of June

Update

  • 2018-06-12 17:16 Access it NIRD is reopened now.
  • 2018-06-12 16:05 Service are started and back in production on NIRD Service Platform.
  • 2018-06-12 15:55 Queue reservation is now removed and jobs are running on part of Fram. Rest of the nodes will be added back to the queue as soon as they are updated.
  • 2018-06-12 14:55 Access is re-opened to Fram. Queue reservation is still in place.
  • 2018-06-12 08:34 Maintenance has started.

Dear Fram and NIRD user,

We will have a one day planned maintenance on 12th of June starting from 08:30 AM.
Fram, NIRD and the Service Platform will be affected. One storage enclosure must be replaced, needing downtime for the file systems served from NIRD.

There is a system reservation in place on Fram starting on 12.06.2018 08:45 AM. Jobs not being able to finish before the maintenance window, will be left pending in the queue with a Reason “ReqNodeNotAvail” and will be started when the maintenance is over.

We will keep you updated via OpsLog/Twitter.

Thank you for your consideration!
Metacenter Operations

Temporary development needs

As many of you know, we have a special setup for development jobs, i.e., short jobs meant for quick development. Now, we see that it is quite challenging to fulfill all development needs with one permanent setup. Hence, if you have proven needs for development of a temporary nature, and those needs do not fit in the devel QoS (https://documentation.sigma2.no/jobs/jobtypes.html), please contact us at support@metacenter.no and we will try to help you.

Fram longer queue times

We will have a four hour maintenance on the cooling system for Fram on 16th of May from 09:00. To limit cooling requirements, only half of the compute nodes will be operational during this time.
This might lead to longer queue times.

Thank you for your understanding!
Metacenter Operations