Stallo downtime Jaunary 10th 10:00 – January 11th 16:00

2019-11-01 14:37 Update: stallo is back online and in production again!

Due to work on the electrical power infrastructure in the building housing stallo, we need to power off the machine in the given periode. All jobs with walltime beyond the start time of the poweroff will be held pending in the queue until the system is up and running again.

Fram, NIRD and Service Platform scheduled downtime starting on 21st November 2018

UPDATE:

  • 2018-11-23 09:45:
    • We plan to re-open access to Fram during today. Will keep you updated.
    • We are currently running benchmarks and tests on the upgraded system.
    • Cooling units are stress-tested now to pinpoint any outstanding issues.
    • OpenMPI is upgraded now.
  • 2018-11-22 22:48:
    • Lustre servers are upgraded on Fram and several tests, benchmarks were run to fine tune parameters.
    • First step of the CDU maintenance is carried out now.
  • 2018-11-22 11:33:
    • NIRD Service Platform is up again.
    • Access to NIRD is reopened. Please note that we have now four login nodes and SSH fingerprint is changed.
  • 2018-11-21 18:41:
    • Needed hardware replacement for NIRD was carried out and all firmware upgrades are finalized on both Tromsø and Trondheim site storage systems.
    • NIRD file systems are started back now and we plan to reopen access tomorrow before noon.
    • Firmware is updated now on the Fram storage system.
    • Several other updates, including the OpenFabrics stack and Lustre, were done in parallel.
  • 2018-11-21 08:00: Maintenance has started.

Dear Fram, NIRD and Service Platform User,

On the 21st and 22nd of November we will have a scheduled maintenance on Fram, NIRD and Service Platform.

This will be a comprehensive maintenance on the national HPC and research data infrastructure, ongoing on multiple levels and sites. Due to it’s complexity and amount of work involved, some parts of the infrastructure might require downtime extension for the 23rd of November, too.

The work will include, but will not be limited to:

– firmware upgrades on disks, enclosures, chassis, etc.

– operative system upgrades

– queue system upgrade

– file system upgrades

– kernel upgrades

– upgrade of OpenMPI

– upgrades on the OpenFabrics stack

– maintenance on the cooling system units

 

Our aim is to enhance the stability and security of the infrastructure, eliminate bugs and enhance performance, while having the shortest downtime possible.

We understand that system unavailability has big impact on your daily work and such we try to bring back our systems functional as soon as possible.

 

Thank you for your consideration!

Metacenter Operations

NIRD and Service Platform downtime

UPDATE

2018-11-12:11:55:  Login node and services are back into production.

2018-11-12 10:20: Disk pool raid sets were rebuilt until Saturday, but a set of drives failed once again. A new rebuild was ongoing and we had to reset IO card and power cycle the storage today. At this point all is up and functional on the storage side and file system is up. We are currently switching back geo-replication and expect to reopen access around 12:00 PM today. Will keep you posted.

2018-11-09 13:59: The firmware is now applied without any problem. However we still need to wait for a rebuild to finish. The time estimate for the rebuild is 12 hours left. We will open the system for regular use as soon as we can.

2018-11-09 12:45: Most of the rebuilds are ready and we are currently patching the firmware on the disk enclosures. If all goes well, we expect to have NIRD up and functional during the day today. Will keep you updated.

2018-11-08 13:27: The firmware update is running. We have to wait for rebuild of broken drives before we can upgrade the enclosures and finnish up the emergency maintenance.  We don’t expect the rebuild to be finished before tomorrow (friday november 9th). Hence the system in whole will not be available before tomorrow.

We are very sorry any inconvenience this may cause.