Dear Fram user. We need to push the maintenance on the FRAM cooling system until the second of February due to some unexpected internal complications from the service vendor. We apologize for the inconvenience this may cause you.
While the maintenance is ongoing there will be reduced capacity on available compute nodes.
We have now updated the job statistics that is printed at the end of the slurm-NNN.out files. The output is updated on Saga and Fram, and will be updated on Betzy shortly.
We hope the new output is more readable, understandable and usable.
It is possible to get the same output on the terminal with the command jobstats -j <jobid> (note: this only works for jobs that are finished). The jobstats command also has a –verbose switch which will produce more detailed output, hints and comments (this will be expanded as time goes).
We have tested the changes on all clusters, but errors can happen, so if you spot any errors and/or missing output in your jobs, please let us know.
X2Go will replace the current VNC solution on Fram. TigerVNC will remain usable until 01.03.23 when it will be shut down. Currently, X2Go is available on login-2 and login-3, while TigerVNC remains available on login-1 until 01.03.23.
All VNC users must therefore switch to X2Go before this date. Documentation can be found here: https://documentation.sigma2.no/getting_started/remote-desktop.html
[UPDATE, 2022-12-19 12:50] The change has been implemented on Saga too now.
We have done a small change in the configuration of the queue system on Betzy and Fram now. The change has the effect that if one of the processes started by “srun” in a job fails (for instance due to a segmentation fault), “srun” will now kill the remaining processes of that job step (just like “mpirun” does). Previously, the remaining processes were left running, possibly until the job timed out. This should solve many of the cases where jobs that fail do not get terminated, but continue until they time out.
The same change will be applied to Saga in about two weeks.
The new behaviour is especially useful when combined with having “set -e” or “set -o errexit” earlier in the job script, because then Slurm will terminate the whole job when an “srun” exits due to one of its processes failing.
If one wants the old behaviour of “srun”, one can override the configuration by using “srun –kill-on-bad-exit=0” instead of just “srun”.
Update, 2022-12-07 12:45: The problem with the /cluster storage system has been identified and fixed, and the file system should work as normal again.
Update, 2022-12-07 09:30: Compute nodes and login nodes are up, and Fram is running jobs, but we are experiencing problems with the /cluster storage system. This shows up as occasional hangs (up to minutes) and/or Input/Output errors. It appears to affect all nodes (login or compute).
Update 2022-12-06 11:32: We have found the cause and restored most services. Still looking into some potential file system issues.
[2022-12-06 09:30] Fram is currently down. We are still investigating, but currently it looks like a gateway router has gone down. We will update this post as we know more.
Most of the NRIS staff is busy with an NRIS all-hands meeting this week, so we will have less capacity to handle support issues. But we will try our best to answer questions.
Because the file system used to store Fram and Betzy project area backups on Saga is full, we have had to stop any further backup of Fram and Betzy project areas. (I.e., /cluster/projects/nnXXXXk on Fram and Betzy).
Already backed up files are still stored, but no new or changed files will be backed up.
The backup will be re-enabled when enough data has been migrated from Saga to the new NIRD storage.
There is a problem with the cooling system on Fram, which leads to many compute nodes automatically shutting down. We are investigating and working on the problem.
Update 19.10.22 – 09:29
Fram is back in production. The problems we experianced yesterday was caused by a small power outage in Tromsø.
Sorry for the inconvenience this may have caused.
– Infrastructure Team
We are going to conduct a file system check on Fram file system. This might lead to degraded performance while the scan is ongoing.
Some users may experience that their access to project directories on HPC is revoked.
This is something we are aware of and know the root cause. We are currently working on restoring access for projects that are affected.
We are sorry about the inconvenience.