Vilje queueing system was unavailable from Sunday 5th 15:30 until monday 6th 08:30, due to a faulty infiniband cable.
We apologize for the inconvenience.
Vilje queueing system was unavailable from Sunday 5th 15:30 until monday 6th 08:30, due to a faulty infiniband cable.
We apologize for the inconvenience.
NIRD storage system was crashed and unavailable for short period of time.
Due to this crash, users logged in to NIRD and Fram experienced problemes.
The problem is resolved, NIRD storage system is online now.
Please contact us if you still encounter problems.
Note: The export of NIRD to FRAM does not work currently
2019-11-13-16:15 Fram is up and running again.
One of the cooling units stoped, causing the other to also stop and all compute nodes went down.
Dear Fram User,
Fram is currently down likely due to issues with the cooling distribution unit.
We are currently investigating the issue and working on placing Fram back into production.
Apologies for the inconvenience!
Metacenter Operations
Dear Fram cluster users:
login-1-2 will be reinstalled, and will be removed from DNS temporarily. It will be added back to DNS when reinstallation is over.
Update: 15:12 login-1-2 is reinstalled and added back to the DNS configuration.
The node mentioned above has to be rebooted due to its unresponsiveness. We are sorry for any inconvenience.
login-1-1 node hanged and had to be rebooted. Up and running again now. Have a nice weekend!
Fram login-1-2 is rebooted around 15:10 today due to the lustre filesystem glitch.
Update:
Dear NIRD and Service Platform users,
We have a planned downtime on the 22nd of August, to replace some defective hardware. Systems will be taken offline starting from 08:00AM.
Engineer from storage vendor will assist us from the very first hour.
We expect the maintenance to finish in one and a half day.
NIRD projects will still be accessible during the maintenance
from login-trd.nird.sigma2.no but in read-only mode.
Will keep you updated here.
Sorry for the short notice.
Metacenter Operations
Dear Fram cluster users:
We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.
Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.
Update 14:55: Some of the nodes are crashed, which means it’s possible that some of the jobs get killed
Update 2019.07.03 10:55: To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.
Update 2019.07.08 11:55: Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.
login-1-1 on Fram become unresponsive and we had to reboot the node.
The login node should be shortly operational again.
Metacenter Operations