Updated job statistics in slurm-NNN.out

[2023-01-10 15:40]

We have now updated the job statistics that is printed at the end of the slurm-NNN.out files. The output is updated on Saga and Fram, and will be updated on Betzy shortly.

We hope the new output is more readable, understandable and usable.

It is possible to get the same output on the terminal with the command jobstats -j <jobid> (note: this only works for jobs that are finished). The jobstats command also has a –verbose switch which will produce more detailed output, hints and comments (this will be expanded as time goes).

We have tested the changes on all clusters, but errors can happen, so if you spot any errors and/or missing output in your jobs, please let us know.

Small config change in queue system

[UPDATE, 2022-12-19 12:50] The change has been implemented on Saga too now.

2022-12-07 12:50

We have done a small change in the configuration of the queue system on Betzy and Fram now. The change has the effect that if one of the processes started by “srun” in a job fails (for instance due to a segmentation fault), “srun” will now kill the remaining processes of that job step (just like “mpirun” does). Previously, the remaining processes were left running, possibly until the job timed out. This should solve many of the cases where jobs that fail do not get terminated, but continue until they time out.

The same change will be applied to Saga in about two weeks.

The new behaviour is especially useful when combined with having “set -e” or “set -o errexit” earlier in the job script, because then Slurm will terminate the whole job when an “srun” exits due to one of its processes failing.

If one wants the old behaviour of “srun”, one can override the configuration by using “srun –kill-on-bad-exit=0” instead of just “srun”.

[SOLVED] Fram is down

Update, 2022-12-07 12:45: The problem with the /cluster storage system has been identified and fixed, and the file system should work as normal again.

Update, 2022-12-07 09:30: Compute nodes and login nodes are up, and Fram is running jobs, but we are experiencing problems with the /cluster storage system. This shows up as occasional hangs (up to minutes) and/or Input/Output errors. It appears to affect all nodes (login or compute).

Update 2022-12-06 11:32: We have found the cause and restored most services. Still looking into some potential file system issues.

[2022-12-06 09:30] Fram is currently down. We are still investigating, but currently it looks like a gateway router has gone down. We will update this post as we know more.

Backup of Fram and Betzy project areas stopped


Because the file system used to store Fram and Betzy project area backups on Saga is full, we have had to stop any further backup of Fram and Betzy project areas. (I.e., /cluster/projects/nnXXXXk on Fram and Betzy).

Already backed up files are still stored, but no new or changed files will be backed up.

The backup will be re-enabled when enough data has been migrated from Saga to the new NIRD storage.

Saga maintenance stop 2022-10-24

[UPDATE, 2022-10-26 19:30: The maintenance is now over. The login nodes are open again, and jobs are running again.]

[UPDATE, 2022-10-24 08:05: The maintenance has now started]

There will be a maintenance stop on Saga starting Monday 2022-10-24 at 08:00. We expect the stop to last three days.

We have set up maintenance reservations on all nodes on the clusters, so jobs that would have run into the reservation will be left pending in the job queue until after the maintenance stop.

Cooling problems on Fram

There is a problem with the cooling system on Fram, which leads to many compute nodes automatically shutting down. We are investigating and working on the problem.

Update 19.10.22 – 09:29

Fram is back in production. The problems we experianced yesterday was caused by a small power outage in Tromsø.

Sorry for the inconvenience this may have caused.

– Infrastructure Team