[UPDATE, 2022-12-19 12:50] The change has been implemented on Saga too now.
We have done a small change in the configuration of the queue system on Betzy and Fram now. The change has the effect that if one of the processes started by “srun” in a job fails (for instance due to a segmentation fault), “srun” will now kill the remaining processes of that job step (just like “mpirun” does). Previously, the remaining processes were left running, possibly until the job timed out. This should solve many of the cases where jobs that fail do not get terminated, but continue until they time out.
The same change will be applied to Saga in about two weeks.
The new behaviour is especially useful when combined with having “set -e” or “set -o errexit” earlier in the job script, because then Slurm will terminate the whole job when an “srun” exits due to one of its processes failing.
If one wants the old behaviour of “srun”, one can override the configuration by using “srun –kill-on-bad-exit=0” instead of just “srun”.