Vilje is online

The infiniband error was due to a controller module with bad connection. This has been corrected.

The queueing system is back online. Also: 19 additional nodes has been recovered.

Three jobs were lost. We apologize for the inconvenience.

 

 

Vilje infiniband problems

We are currently experiencing infiniband problems on VIlje. The queueing system is unavailable until further notice.

Some jobs may have been lost.

 

 

 

Fram up and running again.

2018-23-11 15:24:  Fram is now available again for use. The service was a success and we were able to complete everything on our list, as previously announced via email, nird and nird service platform is already up and running.

Vilje is back online

Vilje is online.

The outage was caused by the loss of infiniband connectivity/loss of two infiniband switches.

36 nodes will remain out of production.

There may still be dns issues with connectivity from innside the cluster to outside (i.e: licence server lookups). Please report any issues to: support@metacenter.no

 

Fram, NIRD and Service Platform scheduled downtime starting on 21st November 2018

UPDATE:

  • 2018-11-23 09:45:
    • We plan to re-open access to Fram during today. Will keep you updated.
    • We are currently running benchmarks and tests on the upgraded system.
    • Cooling units are stress-tested now to pinpoint any outstanding issues.
    • OpenMPI is upgraded now.
  • 2018-11-22 22:48:
    • Lustre servers are upgraded on Fram and several tests, benchmarks were run to fine tune parameters.
    • First step of the CDU maintenance is carried out now.
  • 2018-11-22 11:33:
    • NIRD Service Platform is up again.
    • Access to NIRD is reopened. Please note that we have now four login nodes and SSH fingerprint is changed.
  • 2018-11-21 18:41:
    • Needed hardware replacement for NIRD was carried out and all firmware upgrades are finalized on both Tromsø and Trondheim site storage systems.
    • NIRD file systems are started back now and we plan to reopen access tomorrow before noon.
    • Firmware is updated now on the Fram storage system.
    • Several other updates, including the OpenFabrics stack and Lustre, were done in parallel.
  • 2018-11-21 08:00: Maintenance has started.

Dear Fram, NIRD and Service Platform User,

On the 21st and 22nd of November we will have a scheduled maintenance on Fram, NIRD and Service Platform.

This will be a comprehensive maintenance on the national HPC and research data infrastructure, ongoing on multiple levels and sites. Due to it’s complexity and amount of work involved, some parts of the infrastructure might require downtime extension for the 23rd of November, too.

The work will include, but will not be limited to:

– firmware upgrades on disks, enclosures, chassis, etc.

– operative system upgrades

– queue system upgrade

– file system upgrades

– kernel upgrades

– upgrade of OpenMPI

– upgrades on the OpenFabrics stack

– maintenance on the cooling system units

 

Our aim is to enhance the stability and security of the infrastructure, eliminate bugs and enhance performance, while having the shortest downtime possible.

We understand that system unavailability has big impact on your daily work and such we try to bring back our systems functional as soon as possible.

 

Thank you for your consideration!

Metacenter Operations

NIRD and Service Platform downtime

UPDATE

2018-11-12:11:55:  Login node and services are back into production.

2018-11-12 10:20: Disk pool raid sets were rebuilt until Saturday, but a set of drives failed once again. A new rebuild was ongoing and we had to reset IO card and power cycle the storage today. At this point all is up and functional on the storage side and file system is up. We are currently switching back geo-replication and expect to reopen access around 12:00 PM today. Will keep you posted.

2018-11-09 13:59: The firmware is now applied without any problem. However we still need to wait for a rebuild to finish. The time estimate for the rebuild is 12 hours left. We will open the system for regular use as soon as we can.

2018-11-09 12:45: Most of the rebuilds are ready and we are currently patching the firmware on the disk enclosures. If all goes well, we expect to have NIRD up and functional during the day today. Will keep you updated.

2018-11-08 13:27: The firmware update is running. We have to wait for rebuild of broken drives before we can upgrade the enclosures and finnish up the emergency maintenance.  We don’t expect the rebuild to be finished before tomorrow (friday november 9th). Hence the system in whole will not be available before tomorrow.

We are very sorry any inconvenience this may cause.

Fram $HOME migrated to /cluster file system

Dear Fram User,

Some of you might have experienced sporadic I/O hangs on Fram in the past period.
In many cases the I/O hangs were caused by overloading the RPC queue on the NFS mounted /nird/home file system. This had negative performance impact on the compute nodes, in some cases lead to job crashes.

Therefore we have decided to migrate all Fram user’s $HOME directory from /nird/home/$USER to /cluster/home/$USER, starting with the next upcoming scheduled maintenance. Preparations has been made and some accounts were already synchronized over during past few weeks.

Since today we suddenly lost a big amount of disks on NIRD, to avoid data loss, we have decided to stop all user I/O on NIRD and migrate the remaining user accounts over to Fram.

Starting from today – 2018-11-07 – /nird/home is unmounted from Fram, but will still be available on NIRD. Until next upcoming maintenance we have created a symbolic link from /nird/home to /cluster/home so that eventual scripts can be adjusted.

As soon as NIRD disk issues are remediated, nightly backups will be taken from Fram to /nird/home/$USER/backup/fram.

This step made Fram less dependent on NIRD, thus from this point on, we will be able to schedule maintenance on NIRD, without having impact on running jobs.

Thank you for your understandings!
Metacenter Operations

Emergency Stop of Fram, NIRD and the Service Platform

Update:

  • 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
  • 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.

 

Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data.  This means stopping all jobs and user processes, and logging users out of the systems.

We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today.  (Note that the NIRD project areas will _not_ be available until NIRD is up again.)

We will update this post with more information when we know more.