Betzy is experiencing issues with srun and Infiniband that now and again affect users. The srun problem first appeared last week, the Infiniband problem has been there longer.
Symptoms for srun problem are of a type related to send/recv operation messages. If you see such error-messages you are probably affected by the problem. A workaround can be to just keep trying to run the srun job until it succeeds.
Symptoms for the Infiniband problem are messages of type Transport retry count exceeded. At the moment there is no workaround for this problem.
The expert-team is working on solving the problems together with the hardware vendors. We are very sorry for the inconvenience, but can assure you that the team is working hard on solving this.
Access to Stallo has now been closed temporarily due to power testing tomorrow morning 07:00-09:00 in the housing building of Stallo. We will try to put it back into service a.s.a.p after tests have been run.
Note that probabel end-of life for the old monster will be Jan. 04th 2021.
We have problem with Fram compute nodes, there are about 870 nodes is down due to unknown reason, we are working on the issue, and will keep you updated.
Update 2020-12-22, 20:05: Most of the compute nodes have now been brought back online. There are still a few nodes that needs more checking before being made available for jobs.
Update 2020-12-22, 18:04: The cooling system has been stable for the last hour after making some adjustments together with the vendor. We are slowly bringing up the nodes.
Update 2020-12-22, 16:01: In order to keep the cooling as stable as possible, we have decided to take down all high memory nodes. This way we can keep some of the normal compute nodes up for the time being. We are also working together with the vendor to make adjustments on the cooling system to ensure continued stability.
We are very sorry about the inconvenience.
Update 2020-12-22, 13:41: We have identified the cause to be the cooling system and are working on mitigating the issues. Most of the compute nodes must remain down while doing so, unfortunately.
Update 2020-12-24 10:30: Compute nodes shutdown again due to electrical problems in machine room, problem has been resolved according to machine room service department, we are working to take up all nodes.
Update 2020-12-24 12:10: Most of the compute nodes on Fram is back online.
Stallo downtime on Dec 16th is cancelled. Instead downtime on Dec 29th, due to work on power infrastructure. So you can run your calculations till after Christmas without us getting in the way. Regards this a Xmas present from all of us to all of you.
We need to move oldest limps of the faltering monster. In case of trouble, we have choosen to close the machine for users during maintenance time since we may have to power down the network switches on internal network. Downtime is scheduled for 08:00-15:00.
UPDATE – 27.11/16:20 – we have opened the machine for you guys but there might be some instabilities on global file system as we have also lost one object storage server. The issue is being investigated and we are waiting for some spare parts.
We have some major problems with Lustre file system at the moment. One of the main storage coolers is down. We are kicking out all users now and hope to get the machine back to an operational state ASAP.