we are going to perform some routine maintenance on one of the file system controllers of FRAM. This should have no significant implications for production, users might experience slightly degraded Lustre (file system) performance.
The operation is scheduled for today – 11 a.m. …
Update 8.07: There were also performance issues with the login nodes. This and the controller maintenance is now finished.
We are experiencing some troubles with FRAM machine. Yesterday morning (Sunday 04.07.2021) there were many compute nodes that went unexpectedly down. We are investigating the issue.
Update 05.07.2021 – 10:54: The shutdown was caused by a power outage in the data center. We are taking all nodes up and monitoring their behavior.
Apologies for the inconvenience this may have caused!
Dear Saga and Fram users,
The VNC service is not working smoothly at the moment and we are investigating the issue.
We are sorry for the troubles this might cause.
login-1-3 on Fram had runaway processes that ended up using up all memory and swap, so we unfortunately had to reboot it.
Update: The file system servers have now been fixed, and we are back online again. Thank you for your patience.
We have an ongoing performance issue with Fram filesystem. We need to shut down file servers to get this fixed, and therefore need to have three hours downtime:
Wednesday 20th January between 12:00 and 15:00, Fram will be unavailable
Dear Fram users,
We have problem with Fram compute nodes, there are about 870 nodes is down due to unknown reason, we are working on the issue, and will keep you updated.
Update 2020-12-22, 20:05: Most of the compute nodes have now been brought back online. There are still a few nodes that needs more checking before being made available for jobs.
Update 2020-12-22, 18:04: The cooling system has been stable for the last hour after making some adjustments together with the vendor. We are slowly bringing up the nodes.
Update 2020-12-22, 16:01: In order to keep the cooling as stable as possible, we have decided to take down all high memory nodes. This way we can keep some of the normal compute nodes up for the time being. We are also working together with the vendor to make adjustments on the cooling system to ensure continued stability.
We are very sorry about the inconvenience.
Update 2020-12-22, 13:41: We have identified the cause to be the cooling system and are working on mitigating the issues. Most of the compute nodes must remain down while doing so, unfortunately.
Update 2020-12-24 10:30: Compute nodes shutdown again due to electrical problems in machine room, problem has been resolved according to machine room service department, we are working to take up all nodes.
Update 2020-12-24 12:10: Most of the compute nodes on Fram is back online.
On 1st of December 2020 between 07:45 and 16:00 there will be a power outage on Fram compute nodes due to scheduled maintenance on UPS and backup power equipment.
As of today, Wednesday 4th, November at 08:00, Fram is down for maintenance. We will do the same exercise as on NIRD-TOS, namely change all internal cables on the storage system.
17:20 NIRD-TOS and services are now up.
We will have downtime the following week to try again to replace all internal cables in NIRD-TOS and Fram storage systems.
NIRD-TOS (Including the toolkit) will be down from 08:00 Monday 2nd November to wednesday 4th 12:00
Fram will be down from Wednesday 4th 08:00 until Friday 6th 12:00
There is still a chance that the downtime will not happen, but proper notification will be given in the opslog. Unfortunately the current situation with Covid-19 makes it difficult to make detailed plans.
We apologize for any inconvenience.
The downtime for NIRD-TOS on 26th October until 29th October is cancelled and the downtime for Fram from 28th October until 29th of October is cancelled.
New dates for the downtime will be announced monday 26th or tuesday 27th.
During the downtime we will replace all internal cables between disk controllers and disk enclosures. The firmware upgrade two weeks ago helped a lot, but we are still seeing ccommunication errors so the decision is to remove all cables and replace them.