One of the fileserver had a problem which caused some of the folders under /cluster is unavailable.
The problem is resolved now.
NIRD storage system was crashed and unavailable for short period of time.
Due to this crash, users logged in to NIRD and Fram experienced problemes.
The problem is resolved, NIRD storage system is online now.
Please contact us if you still encounter problems.
Note: The export of NIRD to FRAM does not work currently
Dear Vilje cluster users:
/work filesystem on Vilje is partially down. we are working on it.
At the moment it’s very difficult to determine when we can take /work filesystem fully back online.
We will keep you posted.
Dear Fram cluster users:
login-1-2 will be reinstalled, and will be removed from DNS temporarily. It will be added back to DNS when reinstallation is over.
Update: 15:12 login-1-2 is reinstalled and added back to the DNS configuration.
Fram login-1-2 is rebooted around 15:10 today due to the lustre filesystem glitch.
Dear Fram cluster users:
We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.
Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.
Update 14:55: Some of the nodes are crashed, which means it’s possible that some of the jobs get killed
Update 2019.07.03 10:55: To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.
Update 2019.07.08 11:55: Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.
Dear Fram User,
Once again has the Fram metadata server crashed and likely had impact on running jobs.
We are in contact with the storage vendor for patching the file system.
Apologies for the inconvenience!
Missing mounts for Project NS2345K on NIRD
The NFS server which is exporting /tos-project1/NS2345K/FRAM to NIRD/FRAM has crashed yesterday around 16:00. We have recovered the NFS server and /tos-project1/NS2345K/FRAM is re-exported agian. We are currently working on mounting the filesystem in login containers and meanwhile investigating the cause. We are sorry for the inconveniences caused.
Main MDS server for lustre filesystem crashed between 14:00 and 14:30, And secondary MDS server took over and restored filesystem around 14:40 . Some of the jobs running on Fram might be affected. We are investigating the root cause of the incident.
Early Saturday we had troubles on MDS, some of the jobs running on Fram might be affected. We are investigating the root cause of the incident.