A sudden high pressure in the cooling system around 13 o’clock, has taken one of the cooling units down. Starting it back affected the other unit as well.
This triggered a safety stop for some of the computing nodes, leading to premature crash for some of the running jobs.
Affected jobs has been re-queued.
Apologies for the inconvenience it has created.
Fram has been in production for half a year now, and we’ve gathered enough data to see possible improvements on defaults. One such improvement is related to how jobs are placed with regards to the island topology on Fram. The way Fram is built, the network bandwidth within an island is far better than between islands. For certain types of jobs spanning many compute nodes, being spread over multiple islands can give a negative impact on performance.
To limit this effect we have now changed the default setup so that each job will run within one island, if that does not delay the job too much, as described here:
Note that this may lead to longer waiting in the queue, in particular for larger jobs. If your job does not depend on high network throughput, the above mentioned document also describes how to override the new default.
The SLURM queue system hung on Fram.
The problem has been remediated and the queue system is functional again since approximately 09:55.