[Finished] NIRD Service Platform Maintenance, 22-23 September

Update 2021-09-23: The maintenance is now finished on both sites. Services should be back in production.

Dear users,

We’ll have scheduled maintenance on the NIRD Service platform on 22 and 23 September in order to perform upgrades on the clusters.

In addition to project deployments running on the service platform, the following services are affected during the maintenance:

  • NIRD Toolkit
  • NIRD Archive
  • EasyDMP

The service platform consists of two sites, one in Tromsø and the other in Trondheim. This maintenance will be performed on one site at a time, planned as follows:

22 september: Tromsø
Services running on TOS-SP will be offline. NIRD will be accessible from login-trd.nird.sigma2.no.

23 september: Trondheim

Services running on TRD-SP will be offline. NIRD will be accessible from login-tos.nird.sigma2.no.

To check what site your project is running on, you may log in on the NIRD login-nodes and run the following command: (ssh login.nird.sigma2.no)

readlink /projects/<project number>

Make sure to write the project number in all uppercase.
This will then output the full path to the volume, starting with either “trd” for Trondheim or “tos” for Tromsø.

Example:

[user@login0-nird-trd ~]$ readlink /projects/NS9999K
/tos-project3/NS9999K

The output indicates that this project have it’s primary site in Tromsø (tos-project).

If you have any questions, please do not hesitate to contact us.

Maintenance on NIRD, NIRD Toolkit and Fram , 20th April -24th April

23 April – 18:50 NIRD and the NIRD toolkit services are now back into production

24th April: Fram is back in production.

WARNING: MAINTENANCE IS CURRENTLY ONGOING!

Dear NIRD, NIRD Toolkit, and Fram User,

We will have a four day long scheduled maintenance on NIRD, NIRD Toolkit and Fram starting on the 20th of April, 09:00 AM.

Running HPC jobs and logging in to Saga is NOT affected.
NIRD connectivity, and backup of files, from Saga IS affected

During the maintenance we will:

  • carry out software and firmware updates on all systems

Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram and Saga too.

Login services to NIRD, NIRD-toolkit and Fram will be disabled during the maintenance

Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.

Please accept our apologies for the inconvenience this downtime is causing.

Metacenter Operations

NIRD and NIRD Toolkit scheduled maintenance

Update:

  • 2020-01-23 17:30: Services are now progressively restarted.
  • 2020-01-22 21:49: We have detected file system level corruption and to avoid data corruption we had to unmount and rescan all the file systems (about 18PB) on NIRD.
    We are currently working on starting back the services on NIRD Toolkit.
  • 2020-01-22 11:11: Software and firmware is now upgraded on NIRD Toolkit.
    Most of the fileset changes are also carried out. We are currently working on the last bits. Will keep you updated.
  • 2020-01-20 08:58: Maintenance has started. NIRD file systems are unmounted from Fram until maintenance is finished.

Dear NIRD and NIRD Toolkit User,

We will have a three day long scheduled maintenance on NIRD and NIRD Toolkit starting on the 20th of January, 09:00 AM.

During the maintenance we will:

  • carry out software and firmware updates,
  • change geo-locality for some of the projects,
  • replace synchronization mechanisms,
  • depending on part delivery times from disk vendor – expand the storage and quotas.

Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram too.

Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.

Please accept out apologies for the inconvenience this downtime is causing.

Metacenter Operations

NIRD and service platform downtime on Thursday 22nd of August.

Update:

  • 2019-08-26 12:45: NIRD project areas are mounted on Fram login nodes.
  • 2019-08-25 14:15:Service Platform is up now, you can login now to NIRD and access your files.
    NIRD project areas will be reconnected to Fram tomorrow.
  • 2019-08-23 18:42: Vendor started a forced health check on the system which is taking more time then expected. We will re-open access to NIRD and Service Platform as soon as checks and rebuilds are finished.
  • 2019-08-23 08:05: Storage vendor has finished the hardware replacements and installation of new firmware on the storage system.
    We are currently monitoring the storage system together with the vendor.

Dear NIRD and Service Platform users,

We have a planned downtime on the 22nd of August, to replace some defective hardware. Systems will be taken offline starting from 08:00AM.

Engineer from storage vendor will assist us from the very first hour.

We expect the maintenance to finish in one and a half day.

NIRD projects will still be accessible during the maintenance

from login-trd.nird.sigma2.no but in read-only mode.

Will keep you updated here.

Sorry for the short notice.

Metacenter Operations

NIRD and Service Platform downtime – 26.06.2019

  • 2019-07-02 08:00: All Service Platform services resumed. It might be that some of the services are not properly working and need to be restarted after the maintenance. If you experience any problem with your service, please do not hesitate to contact us asap.
  • 2019-06-27 19:58: NIRD filesystems are mounted back to Fram.
  • 2019-06-27 19:46: NIRD login nodes are started back now, you may login and access your files stored on NIRD.
    Remaining Service Platform services will be started tomorrow morning.
  • 2019-06-27 09:54: We are starting back and testing the file system now.
  • 2019-06-26 22:08: All hardware replacements are done now and the storage system is monitored for any signs of instability. Starting back of the filesystem is planned for tomorrow morning. We will keep you updated.
  • 2019-06-26 14:05: Vendor is meticulously checking each NIRD storage component and decided to replace main controller chassis.
    In the mean time we are applying firmware updates on the Service Platform to improve stability and security.
  • 2019-06-26 08:15: Maintenance has started.

Dear NIRD and Service Platform User,

We have a planned downtime on the 26th of June, Wednesday next week, to replace some defective hardware. Systems will taken offline starting from 08:00AM.

Engineer from storage vendor will assist us from the very first hour.

We expect the maintenance to finish in one day.
Will keep you updated here.

Metacenter Operations

Scheduled downtime – NIRD storage expansion – 2nd of April

Update:

  • 2019-04-03 13:25: NIRD and the service platform are back into production.
  • 2019-04-03 10:59: Maintenance work has finished. We are proceeding in starting back the filesystems and services.
  • 2019-04-03 08:22: Disk expansion and rebalancing is finished. HW checks are currently ongoing and shall finish in a couple of hours. Will keep you posted.
  • 2019-04-02 09:55: NIRD filesystems are unmounted from Fram and replicated data is available read-only trough login-trd.nird.sigma2.no
  • 2019-04-02 08:06: Maintenance work has started.

Dear NIRD User,

NIRD and the Service Platform will be under maintenance to expand the disk capacity in Tromsø.

The operations for storage expansion and disk pool rebalancing will start on the 2nd of April at 8:00 am CET and will last for maximum 2 days. During the maintenance, the services running on the NIRD Service Platform and on the NIRD Toolkit will not be available.

During the downtime we plan to make project data mirrored to Trondheim available in read-only mode trough a specially built login node. This solution will be first tested with real load during this downtime, thus we might encounter some technical difficulties.
That being said, to access the remote, mirrored data, please login to login-trd.nird.sigma2.no.

We apologise for the inconvenience.
Metacenter Operations

NIRD available again

Dear NIRD and NIRD Toolkit User,

After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.

We will proceed in taking up the Service Platform during today.

Thank you for your understanding and patience!
Metacenter Operations

Scheduled downtime on the 12th of February

Update:

  • 2019-02-26 08:57: Fram parts have arrived and were installed yesterday. The vendor will start rebuilding the disk pools on Fram today.
  • 2019-02-22 09:10: Access to NIRD is reopened. We will proceed in taking up the Service Platform during today. Some disk pools are still under rebuild and should be finished in few hours. So until then, you might encounter slight performance loss.
  • 2019-02-21 15:40: Most of disk rebuild on NIRD are finished. NIRD file system is started and we will proceed with opening access to NIRD as soon as remaining network issues are sorted out on the backbone.
    Due to logistic issues, Fram parts are held back at the customs. The vendor sent a second batch of parts through another logistic company and ETA is Monday morning.
    We will post a message when access to NIRD is re-opened.
  • 2019-02-19 14:10: Disk rebuilds on NIRD have reached 40%. Current ETA for NIRD is Thursday morning.
    We are still waiting for the Fram parts to arrive to Tromsø.
  • 2019-02-18 16:36: Some disk pools must be rebuilt once again for NIRD, thus delaying the opening once more. We are terribly sorry for this. Will continue updating the log as soon as new information is available.
  • 2019-02-18 10:10:
    NIRD storage is stabilized now and the vendor will do a new attempt of taking the system back online during today. At this stage it is still uncertain when Fram can be put back into production.
  • 2019-02-15 11:18: We are still experiencing problems with the storage system on Fram. Disks begun to mass-fail once again after the system seemed to be stable during the night. We are depending on the vendor to resolve these issues and we are working closely with them.
    Based on the new instability we can not give an estimate for when the system will be ready for general use again. This is an unfortunate situation and we understand the impact on you, and thus we try all possible solutions to keep your data safe and bring up the system as soon as possible.The OpsLog will be updated with new information when the status of the situation changes.
  • 2019-02-14 13:17: Due to missing parts, and the size of the storage, disk recovery is progressing slowly ahead on approximately 50% reduced performance. Current ETA are:
    • Fram: 15.02.2019
    • NIRD: 19.02.2019
    • Service Plattform: 19.02.2019
  • 2019-02-13 19:07: Communication with the missing storage enclosures were re-established and disk pools are rebuilding at this time. Unfortunately we can not reopen machines until disk pools are stabilized. We will have a new round of checks and risk analysis tomorrow morning. Will keep you updated here.
  • 2019-02-13 11:33: Some of the parts arrived to the datacenter and we are working with the vendor on replacing and pathing the firmware on Fram. More details to follow as we know more.
  • 2019-02-12 15:38: NIRD Tromsø and Fram storages have each one disk enclosure which failed. We are waiting for replacement parts to arrive. After replacement we will have to rebuild disk pools before re-opening machines for production.Current estimate is tomorrow evening. Will keep you updated.
  • 2019-02-12 12:36: Firmware upgrade on NIRD is finished. We are proceeding to start back NIRD services. Will keep you posted.
  • 2019-02-12 08:17: Maintenance has started.
  • 2019-02-11 13:20: Due to the disk problems accellerating during the weekend, we have now changed the maintenance stop reservation so no new jobs will start until the maintenance is done.  Already running jobs will not be affected, but no new jobs will start.  This has been done to reduce the risk of data loss.

We need to have a scheduled downtime on a relatively short notice in order to upgrade the firmware on both Fram and NIRD (including NIRD Toolkit) storages.
This is a critical and mandatory update which will increase stability, performance and reliability of our systems.

The downtime is expected to last no more than a working day.

Fram jobs which can not finish by the 12th of February, are queued up and will not start until the maintenance is finished.

Thank you for your understanding!
Metacenter Operations