Betzy access closed, preparing for production

UPDATE:

  • 25.09.2020: We are temporarily reopening the access over the weekend in order to allow further testing on the machine.
    Further work is expected to be done by the vendor sometime next week and as a consequence, jobs will be terminated again and access closed while maintenance will be ongoing.

Dear Betzy pilots,

We are pleased to announce that despite logistics challenges caused by Covid-19, most of the outstanding issues were sorted out. This unusual situation requested a more dynamic approach from everyone involved, while putting pressure on the communication due to uncertainties and quick situation changes. Because of this, setting and advertising a production date proved to be difficult.

We can now start aiming for setting Betzy into production in the beginning of October. Before we can conclude, and proceed with the preparations, we need to re-run several comprehensive tests.

Therefore, we will have to stop all jobs and access to Betzy starting from tomorrow, 17 September 2020 10AM. Access to Betzy will be re-established as soon as all the tests are effectuated. Please be prepared for a more extensive maintenance this time, which might require up to two and half weeks.

The file system on Betzy is not going to be reformatted. That is, your data will not be removed intentionally. However, we can not guarantee data integrity until backups are taken and the machine is placed into production. Therefore, we strongly advise you to take a backup of your important data for the sake of security.

Apologies for the short notice and the inconvenience this is causing to you.

Best regards,

Lorand Szentannai, on behalf of the preparations team

Fram machine room cooling problem

Dear Fram cluster users:

We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.

Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.

Update 14:55: Some of the nodes are crashed, which means it’s possible that  some of the jobs get killed

Update 2019.07.03 10:55:  To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.

Update 2019.07.08 11:55:  Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.

NIRD and Service Platform downtime – 26.06.2019

  • 2019-07-02 08:00: All Service Platform services resumed. It might be that some of the services are not properly working and need to be restarted after the maintenance. If you experience any problem with your service, please do not hesitate to contact us asap.
  • 2019-06-27 19:58: NIRD filesystems are mounted back to Fram.
  • 2019-06-27 19:46: NIRD login nodes are started back now, you may login and access your files stored on NIRD.
    Remaining Service Platform services will be started tomorrow morning.
  • 2019-06-27 09:54: We are starting back and testing the file system now.
  • 2019-06-26 22:08: All hardware replacements are done now and the storage system is monitored for any signs of instability. Starting back of the filesystem is planned for tomorrow morning. We will keep you updated.
  • 2019-06-26 14:05: Vendor is meticulously checking each NIRD storage component and decided to replace main controller chassis.
    In the mean time we are applying firmware updates on the Service Platform to improve stability and security.
  • 2019-06-26 08:15: Maintenance has started.

Dear NIRD and Service Platform User,

We have a planned downtime on the 26th of June, Wednesday next week, to replace some defective hardware. Systems will taken offline starting from 08:00AM.

Engineer from storage vendor will assist us from the very first hour.

We expect the maintenance to finish in one day.
Will keep you updated here.

Metacenter Operations

NIRD available again

Dear NIRD and NIRD Toolkit User,

After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.

We will proceed in taking up the Service Platform during today.

Thank you for your understanding and patience!
Metacenter Operations