- 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
- 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.
Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data. This means stopping all jobs and user processes, and logging users out of the systems.
We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today. (Note that the NIRD project areas will _not_ be available until NIRD is up again.)
We will update this post with more information when we know more.
- 11:15 Cooling distribution units are functional again and computes started back once again.
- 10:12 Cooling units failed once again and computes were automatically switched off. We are looking into the problem.
- 09:22 Cooling is functional again and Fram computes are started back now and machine shall shortly be fully operational.
We had troubles with one of the cooling units in the Fram server room today around 06:30.
Safety mechanisms switched off biggest part of the Fram compute nodes.
Thank you for your understanding!
Update 2018-08-02 09:35 Most of the computes are up and we are working to fix the remaining few. Jobs are running again.
Compute nodes went down due to a power spike on 1st of August around 7 o’clock PM. We are starting back the system and will update this post as soon as the system is functional again.
Update 2018-06-27 10:57 /cluster file system is up again on Fram.
The shared filesystem on Fram (/cluster) is currently down. We are investigating it, and are trying to get it up again as soon as possible. We will update here when we know more.
550 nodes went down at 00:00 Monday morning. We are investigating the issue and will bring nodes back online as soon as possible
Some of the Lustre object storage servers crashed during the night, making parts of the /cluster file system unaccessible. We working on the problem and will keep you updated.
Update 15:50: NIRD login node is up again and user access reopened.
We have to urgently reboot the NIRD login node.
This post will be updated when login to NIRD is possible again.