[Update, 2022-04-30 11:10] The Fram and Saga maintenance is now over, and jobs are running again.
[Update, 2022-04-29 08:00] The Fram and Saga maintenances have now started.
[Update, 2022-04-28 12:56] The Betzy maintenance is now over, and jobs are starting again.
[Update, 2022-04-28 08:00] The Betzy maintenance has now started.
There will unfortunately be maintenance stops on all NRIS clusters next week, for an important security update. The maintenance stops will be
- Betzy: Thursday, April 28. at 08:00
- Fram and Saga: Friday, April 29. at 08:00
We expect the stops will last a couple of hours. We have set up maintenance reservations on all nodes on the clusters, so jobs that would have run into the reservation will be left pending in the job queue until after the maintenance stop.
We are sorry for the inconvenience this creates. We had hoped to be able to apply the security update with jobs running, but that turned out not to be possible.
The queue system configuration of the GPU nodes on Betzy had an error: The number of CPUs were set to 128 instead of 64. Most jobs would probably not be affected by this, but it is possible that some jobs got sub-optimal cpu pinnings.
This has now been fixed, and the documentation updated. There is nothing users have to do with their job scripts (except if they asked for more than 64 cpus per node).
There will be a short maintenance stop of Saga and Betzy on Thursday, Feburary 3. at 15:00 CET, due to work on the cooling system in the data hall. The downtime is planned to last for three hours.
During the downtime, no jobs will run, but the login nodes and the /cluster file system will be up. Jobs that cannot finish before 15:00 at February 3, will be left pending in the queue until after the stop.
The login node login-1 on Fram had crashed, and is currently being rebooted. It is hopefully back in a few minutes.
We just had to reboot login-2 on Fram due to a runaway user process that used up all memory and was not killable. The machine became unresponsive and had to be restarted.
The /cluster file system on Fram is currently slow. We are investigating the cause, and will update here once we know more.
Due to changes in Slurm in recent versions, we have changed the recommended way to run interactive jobs on Saga and Fram (but not yet on Betzy) to using salloc instead of srun. See updated documentation here.
Update, 2021-10-11 08:15: The maintenance is now finished, and the compute nodes are in production again. (There are still some nodes down, they will be fixed and returned to production. Also, the VNC service is not up yet. We are looking at it.)
Update, 2021-10-08 15:40: We have now opened the login nodes for users again. The work on the cooling system is taking longer than we hoped, so the compute nodes will not be available until Monday morning.
Udate: The maintenance stop has now started.
UPDATE OCTOBER 4TH:
Login and file system services will be available during Friday or earlier, but running jobs will not be possible until Monday morning
There will be a maintenance stop on Fram starting Wednesday October 6 at 12:00 and ending Friday 8 in the afternoon. All of Fram will be down and unavailable during that time. Jobs that would not finish before the maintenance starts will be left pending until after the maintenance.
The main reason for the maintenance is replacements of some parts of the cooling system. During the stop, the OS of compute and login nodes will be updated from CentOS 7.7 to 7.9, and Slurm will be upgraded to 20.11.8 (the same version as on Saga).
[2021-06-23 14:26] The issue is now solved and the jobs have now started to run. Please report if you experience any further issues to email@example.com
[2021-06-23 09:20] We are again experiencing problems on Betzy. We will update here when we’ve solved the issue.
[2021-06-22 11:15] The problem has been located and fixed, and Betzy should work as normal again.
[2021-06-22 09:30] We are currently experiencing network problems on Betzy. We don’t know the full extent of it, but it is at least affecting the queue system, so all Slurm-related commands are hanging.
We are investigating, and will update when we know more.
[2021-06-25 08:45] The maintenance stop is now over, and Saga is back in full production. There is a new version of Slurm (20.11.7), and storage on /cluster has been reorganised. This should be largely invisible, except that we will simplify the dusage command output to only show one set of quotas (pool 1).
[2021-06-25 08:15] Part of the file system reorganisation took longer than anticipated, but we will start putting Saga back into production now.
[2021-06-23 12:00] The maintenance has now started.
[UPDATE: The correct dates are June 23–24, not July]
There will be a maintenance stop of Saga starting June 23 at 12:00. The stop is planned to last until late June 24.
During the stop, the queue system Slurm will be upgraded to the latest version, and the /cluster file system storage will be reorganised so all user files will be in one storage pool. This will simplify disk quotas.
All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.