Cooling system issues – some jobs crashed

A sudden high pressure in the cooling system around 13 o’clock, has taken one of the cooling units down. Starting it back affected the other unit as well.
This triggered a safety stop for some of the computing nodes, leading to premature crash for some of the running jobs.

Affected jobs has been re-queued.

Apologies for the inconvenience it has created.
Metacenter Operations

Planned maintenance on 12th of June

Update

  • 2018-06-12 17:16 Access it NIRD is reopened now.
  • 2018-06-12 16:05 Service are started and back in production on NIRD Service Platform.
  • 2018-06-12 15:55 Queue reservation is now removed and jobs are running on part of Fram. Rest of the nodes will be added back to the queue as soon as they are updated.
  • 2018-06-12 14:55 Access is re-opened to Fram. Queue reservation is still in place.
  • 2018-06-12 08:34 Maintenance has started.

Dear Fram and NIRD user,

We will have a one day planned maintenance on 12th of June starting from 08:30 AM.
Fram, NIRD and the Service Platform will be affected. One storage enclosure must be replaced, needing downtime for the file systems served from NIRD.

There is a system reservation in place on Fram starting on 12.06.2018 08:45 AM. Jobs not being able to finish before the maintenance window, will be left pending in the queue with a Reason “ReqNodeNotAvail” and will be started when the maintenance is over.

We will keep you updated via OpsLog/Twitter.

Thank you for your consideration!
Metacenter Operations

Temporary development needs

As many of you know, we have a special setup for development jobs, i.e., short jobs meant for quick development. Now, we see that it is quite challenging to fulfill all development needs with one permanent setup. Hence, if you have proven needs for development of a temporary nature, and those needs do not fit in the devel QoS (https://documentation.sigma2.no/jobs/jobtypes.html), please contact us at support@metacenter.no and we will try to help you.

Fram longer queue times

We will have a four hour maintenance on the cooling system for Fram on 16th of May from 09:00. To limit cooling requirements, only half of the compute nodes will be operational during this time.
This might lead to longer queue times.

Thank you for your understanding!
Metacenter Operations

 

Problems with $HOME on Fram

Some of the compute nodes and additionally Fram login nodes lost connection to the NFS mounted $HOME.
Login nodes were rebooted to cleanup hanging processes and blocking I/O.

We are investigating this issue and working on a solution.

Thank you for your understanding!
Metacenter Operations

EasyDMP is in production!

Dear NIRD Users,

It is with great excitement that we in UNINETT Sigma2 hereby announce the launch of the easyDMP, a new service that offers researchers, with minimal experience in data management, a simple way of creating a Data Management Plan (DMP). This is achieved by transforming any funding agency’s or institution’s data management guidelines and policies into a series of easy to answer questions, many containing a simple list of canned answers to pick from. The resulting plan can be used as a blueprint for researchers to put in place the necessary elements that ensure their data are adequately managed. The plan can be edited and shared, and also duplicated to serve as a starting point for other datasets.

EasyDMP is free of charge and available to any researcher in Norway and in Europe:

https://easydmp.sigma2.no/

EasyDMP has been developed and is operated by Sigma2 in collaboration with the EUDAT2020 project. EasyDMP presently implements the EU H2020 recommendations, but the service has been design to easily integrate other schemas, for example institutional specific recommendations. Please do not hesitate to contact us if you want to integrate the easyDMP with your own tailored DMP questionnaire scheme.

Improvements to the tool will be driven by your needs. Thanks to the continuous deployment method, the easyDMP service will be adding new functionalities continuously. We can already anticipate that the next release will have functionality that enables other services to make use of the plan output in compliance with the FAIR principles.

We are now working to establish an external reference group for the service, that will include experts from user communities, librarians and curators and national service providers. This because we really believe that the easyDMP service will benefit from a wide national pool of competence and stakeholders.

Please do feel free to test it and start using it, and please do not hesitate to give us feedback at (support @easydmp.sigma2.no).

More info about easyDMP here:
https://www.sigma2.no/content/easydmp

2 days downtime starting on 25th of April

Update:

  • 2018-04-30 14:46 File system issues are solved now on Fram and access is reopened. Jobs are temporarily on hold due to some troubles with the cooling system in the server room. As soon as that is sorted out, jobs will be permitted again.
  • 2018-04-30 10:15 We are still struggling with the /cluster file system. The problem is escalated to the Vendor. At the moment we do not have a time estimate when Fram is back online, but there is work in progress to fix this as soon as possible, hopefully during the day.
  • 2018-04-27 18:44 Unfortunately there are still problems taking up the Lustre file system on Fram. Issue is caused by an incompatibility hitting routing between IB networks/fabrics on the Lustre object storage servers. The vendor is now planning and working to carry out an emergency update on the system. We are sorry for the trouble.
  • 2018-04-27 16:49 Access to NIRD is reopened now.
  • 2018-04-26 22:50 We are having problems on taking up the Lustre file system on Fram. The issue is reported to the vendor. Additionally, there are some minor issues which must be addressed on NIRD before opening it for production, but we expect reopening the access to both Fram and NIRD during tomorrow.

 

Dear Fram and NIRD user,

A two day downtime is scheduled for week 17. The scheduled maintenance will start on Wednesday, 25th of April, at 09:00 AM and will affect Fram, NIRD and the Service Platform.

During this time we will:
1. Extend NIRD storage space with ~1.1PB.
– The new hardware will be coupled to NIRD and extra disks loaded to the system during these two days.
– Please note that the above advertised storage will not be available at once. Storage space is gradually added as soon as loaded disks are formatted and available to the file system.
– One of our top priorities is to address the inode shortage on $HOME areas.
2. Address file system related bugs on NIRD by upgrading the afferent software and tune some parameters on the servers.
3. Fix broken hardware on Fram.
4. Apply any outstanding patches to both Fram and NIRD.
5. Carry out maintenance work on the cooling system for Fram.

There is a job reservation in place on Fram starting on 08:45 AM 25th of April.  Jobs that cannot complete before that time, will be left pending in the queue with a Reason “ReqNodeNotAvail” and an estimated start time of 2154.  They will be started when the maintenance is over.

We will keep you updated via OpsLog/Twitter.

Thank you for your consideration!
Metacenter Operations

$HOME file system availability issues on Fram – FIXED

We are experiencing availability issues for $HOME file system on Fram. The problem is currently under investigation and we are actively working on solving it.
Update 09:30:
Problem is fixed now.
One of the file servers exporting $HOME  went down and the failover didn’t work as intended.

Thank you for your understanding!
Metacenter Operations