The downtime has started and will continue until wednesday 8th December evening or until upgrades are done.
There will be a scheduled downtime for Betzy lasting three days starting on Monday 6th December at 08:00. Downtime will last until Thursday 9th, 20:00.
During the downtime we will conduct:
- Full upgrade of the Lustre filesystem (both servers and clients)
- Full upgrade of the infiniband firmware
- Full upgrade of the Mellanox infiniband drivers
- minor updates to other parts of the system (Slurm, configs, etc)
Please be aware that this does also affect the storage services recently moved from NIRD to Betzy.
We apologize for the inconvenience
Update 08.12.2021 18:00 : Betzy downtime is over, and system is open for users. All planned update is performed .
[Update: 2021-11-19 09:50] The maintenance work is now done, and Saga is back in full production and running jobs as normal.
[Update, 2021-11-18 12:40] login nodes are ready for users, users can access their data and work with it. Compute nodes are still under maintenance thus running jobs are still not possible.
[Update, 2021-11-17 12:00]: The maintenance has now started
We will conduct firmware update/maintenance on all of Saga during next week, starting on Wednesday 17th 12:00
Downtime will last until 15:00 on friday 19th, but we will bring back access to login nodes and file system as soon as the upgrade is done on vital parts of the system. Compute ndes will be brought back sequentially while they are updated.
Uninett will conduct a firmware upgrade of one of the routers in Tromsø this Friday, 5th November between 11:00 and 15:00. This will not affect internal networks on Fram, NIRD or NIRD Toolkit or any production on the systems, but external network may briefly disconnect or stall
If the upgrade is successful, the other router will be upgraded next week.
Update 2021-09-23: The maintenance is now finished on both sites. Services should be back in production.
We’ll have scheduled maintenance on the NIRD Service platform on 22 and 23 September in order to perform upgrades on the clusters.
In addition to project deployments running on the service platform, the following services are affected during the maintenance:
- NIRD Toolkit
- NIRD Archive
The service platform consists of two sites, one in Tromsø and the other in Trondheim. This maintenance will be performed on one site at a time, planned as follows:
22 september: Tromsø
Services running on TOS-SP will be offline. NIRD will be accessible from login-trd.nird.sigma2.no.
23 september: Trondheim
Services running on TRD-SP will be offline. NIRD will be accessible from login-tos.nird.sigma2.no.
To check what site your project is running on, you may log in on the NIRD login-nodes and run the following command: (ssh login.nird.sigma2.no)
readlink /projects/<project number>
Make sure to write the project number in all uppercase.
This will then output the full path to the volume, starting with either “trd” for Trondheim or “tos” for Tromsø.
[user@login0-nird-trd ~]$ readlink /projects/NS9999KThe output indicates that this project have it’s primary site in Tromsø (tos-project).
If you have any questions, please do not hesitate to contact us.
Update, 2021-10-11 08:15: The maintenance is now finished, and the compute nodes are in production again. (There are still some nodes down, they will be fixed and returned to production. Also, the VNC service is not up yet. We are looking at it.)
Update, 2021-10-08 15:40: We have now opened the login nodes for users again. The work on the cooling system is taking longer than we hoped, so the compute nodes will not be available until Monday morning.
Udate: The maintenance stop has now started.
UPDATE OCTOBER 4TH:
Login and file system services will be available during Friday or earlier, but running jobs will not be possible until Monday morning
There will be a maintenance stop on Fram starting Wednesday October 6 at 12:00 and ending Friday 8 in the afternoon. All of Fram will be down and unavailable during that time. Jobs that would not finish before the maintenance starts will be left pending until after the maintenance.
The main reason for the maintenance is replacements of some parts of the cooling system. During the stop, the OS of compute and login nodes will be updated from CentOS 7.7 to 7.9, and Slurm will be upgraded to 20.11.8 (the same version as on Saga).
[2021-06-25 08:45] The maintenance stop is now over, and Saga is back in full production. There is a new version of Slurm (20.11.7), and storage on /cluster has been reorganised. This should be largely invisible, except that we will simplify the dusage command output to only show one set of quotas (pool 1).
[2021-06-25 08:15] Part of the file system reorganisation took longer than anticipated, but we will start putting Saga back into production now.
[2021-06-23 12:00] The maintenance has now started.
[UPDATE: The correct dates are June 23–24, not July]
There will be a maintenance stop of Saga starting June 23 at 12:00. The stop is planned to last until late June 24.
During the stop, the queue system Slurm will be upgraded to the latest version, and the /cluster file system storage will be reorganised so all user files will be in one storage pool. This will simplify disk quotas.
All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.
[UPDATE, 2021-06-08 08:00] Betzy is now up and in production again.
[UPDATE] Unfortunately, the downtime is taking longer than anticipated, and will not be finished tonight. We plan on getting Betzy up again at around 08:00 tomorrow morning.
Campusservice at NTNU will conduct maintenance on the High Voltage circuits for Non-redundant power on 7th of June 2021, between 15:00 and 20:00. All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.
Update: The file system servers have now been fixed, and we are back online again. Thank you for your patience.
We have an ongoing performance issue with Fram filesystem. We need to shut down file servers to get this fixed, and therefore need to have three hours downtime:
Wednesday 20th January between 12:00 and 15:00, Fram will be unavailable
We are going to expand the storage on Saga. This will happen during week 50, between 7th and 11th December. Hopefully this will give oss a few Petabytes extra and enough storage for the lifetime of the system.