The node mentioned above has to be rebooted due to its unresponsiveness. We are sorry for any inconvenience.
login-1-1 node hanged and had to be rebooted. Up and running again now. Have a nice weekend!
Fram login-1-2 is rebooted around 15:10 today due to the lustre filesystem glitch.
- 2019-08-26 12:45: NIRD project areas are mounted on Fram login nodes.
- 2019-08-25 14:15:Service Platform is up now, you can login now to NIRD and access your files.
NIRD project areas will be reconnected to Fram tomorrow.
- 2019-08-23 18:42: Vendor started a forced health check on the system which is taking more time then expected. We will re-open access to NIRD and Service Platform as soon as checks and rebuilds are finished.
- 2019-08-23 08:05: Storage vendor has finished the hardware replacements and installation of new firmware on the storage system.
We are currently monitoring the storage system together with the vendor.
Dear NIRD and Service Platform users,
We have a planned downtime on the 22nd of August, to replace some defective hardware. Systems will be taken offline starting from 08:00AM.
Engineer from storage vendor will assist us from the very first hour.
We expect the maintenance to finish in one and a half day.
NIRD projects will still be accessible during the maintenance
from login-trd.nird.sigma2.no but in read-only mode.
Will keep you updated here.
Sorry for the short notice.
Dear Fram cluster users:
We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.
Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.
Update 14:55: Some of the nodes are crashed, which means it’s possible that some of the jobs get killed
Update 2019.07.03 10:55: To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.
Update 2019.07.08 11:55: Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.
login-1-1 on Fram become unresponsive and we had to reboot the node.
The login node should be shortly operational again.
Dear NIRD and NIRD Toolkit User,
After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.
We will proceed in taking up the Service Platform during today.
Thank you for your understanding and patience!
The infiniband error was due to a controller module with bad connection. This has been corrected.
The queueing system is back online. Also: 19 additional nodes has been recovered.
Three jobs were lost. We apologize for the inconvenience.
We are currently experiencing infiniband problems on VIlje. The queueing system is unavailable until further notice.
Some jobs may have been lost.
update 16-11-18 9:00: NIRD and service platform are up again.
Due to disk failures on NIRD, we have to shut down NIRD and the Service Platform immediately to avoid losing user data.
Sorry for the inconvenience.