2019-03-19 17:55 Secondary MDS server for lustre filesystem crashed between 17:15 and 17:45, And primary MDS server took over and restored filesystem around 17:45 . Some of the jobs running on Fram might be affected. We are investigating the root cause of the incident.
Main MDS server for lustre filesystem crashed between 14:00 and 14:30, And secondary MDS server took over and restored filesystem around 14:40 . Some of the jobs running on Fram might be affected. We are investigating the root cause of the incident.
Early Saturday we had troubles on MDS, some of the jobs running on Fram might be affected. We are investigating the root cause of the incident.
Servers running the Service Platform and NIRD login nodes, have some issues with the remote filesystems.
The problem is already identified and being taken care of, but you might experience short hiccups until the problem is fixed on all the affected nodes.
You will have to re-login to NIRD login nodes.
We expect to be ready in maximum one hour.
Mar 4 14:45 CET 2019 One of the login nodes has rebooted. This was caused by a software bug in the Intel suite. The problem was fixed and should not occur again.
2019-02-28: 21:30 Finaly Fram is up and running again and ready for general use. We are very sorry for the long downtime. Today the storage vendor have been onsite and fixed the last parts.
Dear Fram users,
after more than 16 days of downtime, losing more than 12 million CPU hours, we were finally ready to return the system to service tonight.
This downtime has been a very unpleasant experience to us, and we sincerely understand that this has been annoying and causing distress to our users depending on the service.
The main reason for the downtime has been severe problems with the global file system on Fram, forcing us to halt the system and escalate towards the file system vendor until their engineers were able to analyse and repair the different issues experienced.
Sincerely, Jørn Amundsen, UNINETT Sigma2 AS
Dear NIRD and NIRD Toolkit User,
After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.
We will proceed in taking up the Service Platform during today.
Thank you for your understanding and patience!
2019-02-13 10:00 stallo is up and running and available for general use again.