NIRD and Service Platform downtime

UPDATE

2018-11-12:11:55:  Login node and services are back into production.

2018-11-12 10:20: Disk pool raid sets were rebuilt until Saturday, but a set of drives failed once again. A new rebuild was ongoing and we had to reset IO card and power cycle the storage today. At this point all is up and functional on the storage side and file system is up. We are currently switching back geo-replication and expect to reopen access around 12:00 PM today. Will keep you posted.

2018-11-09 13:59: The firmware is now applied without any problem. However we still need to wait for a rebuild to finish. The time estimate for the rebuild is 12 hours left. We will open the system for regular use as soon as we can.

2018-11-09 12:45: Most of the rebuilds are ready and we are currently patching the firmware on the disk enclosures. If all goes well, we expect to have NIRD up and functional during the day today. Will keep you updated.

2018-11-08 13:27: The firmware update is running. We have to wait for rebuild of broken drives before we can upgrade the enclosures and finnish up the emergency maintenance.  We don’t expect the rebuild to be finished before tomorrow (friday november 9th). Hence the system in whole will not be available before tomorrow.

We are very sorry any inconvenience this may cause.

Fram $HOME migrated to /cluster file system

Dear Fram User,

Some of you might have experienced sporadic I/O hangs on Fram in the past period.
In many cases the I/O hangs were caused by overloading the RPC queue on the NFS mounted /nird/home file system. This had negative performance impact on the compute nodes, in some cases lead to job crashes.

Therefore we have decided to migrate all Fram user’s $HOME directory from /nird/home/$USER to /cluster/home/$USER, starting with the next upcoming scheduled maintenance. Preparations has been made and some accounts were already synchronized over during past few weeks.

Since today we suddenly lost a big amount of disks on NIRD, to avoid data loss, we have decided to stop all user I/O on NIRD and migrate the remaining user accounts over to Fram.

Starting from today – 2018-11-07 – /nird/home is unmounted from Fram, but will still be available on NIRD. Until next upcoming maintenance we have created a symbolic link from /nird/home to /cluster/home so that eventual scripts can be adjusted.

As soon as NIRD disk issues are remediated, nightly backups will be taken from Fram to /nird/home/$USER/backup/fram.

This step made Fram less dependent on NIRD, thus from this point on, we will be able to schedule maintenance on NIRD, without having impact on running jobs.

Thank you for your understandings!
Metacenter Operations

Emergency Stop of Fram, NIRD and the Service Platform

Update:

  • 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
  • 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.

 

Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data.  This means stopping all jobs and user processes, and logging users out of the systems.

We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today.  (Note that the NIRD project areas will _not_ be available until NIRD is up again.)

We will update this post with more information when we know more.

VASP 5.4.4 with implicit solvation model

VASP 5.4.4,  module named VASP/5.4.4-intel-2018a is now updated with transition state tools (vTST),  implicit solvation model (VASPsol) and occupation matrix control, all with unmodified, abfix and noshear modications in code. Binary names should be self-explanatory, please look in bin for all versions.

new NIRD login node

We have put in place a second NIRD login node.

This node is accessible at

login1.nird.sigma2.no

Report problems to support@metacenter.no.