[DONE] Saga Maintenance Stop 23–24 June

[2021-06-25 08:45] The maintenance stop is now over, and Saga is back in full production. There is a new version of Slurm (20.11.7), and storage on /cluster has been reorganised. This should be largely invisible, except that we will simplify the dusage command output to only show one set of quotas (pool 1).

[2021-06-25 08:15] Part of the file system reorganisation took longer than anticipated, but we will start putting Saga back into production now.

[2021-06-23 12:00] The maintenance has now started.

[UPDATE: The correct dates are June 23–24, not July]

There will be a maintenance stop of Saga starting June 23 at 12:00. The stop is planned to last until late June 24.

During the stop, the queue system Slurm will be upgraded to the latest version, and the /cluster file system storage will be reorganised so all user files will be in one storage pool. This will simplify disk quotas.

All compute nodes and login nodes will be shut down during this time, and no jobs will be running during this period. Submitted jobs estimated to run into the downtime reservation will be held in queue.

New compute nodes on Saga

Today, Saga has been extended with 120 new compute nodes, increasing the total number of CPUs on the cluster from 9824 to 16064.

The new nodes have been added to the normal partition. They are identical to the old compute nodes in the partition, except that they have 52 CPU cores instead of 40.

We hope this extension will reduce the wait time for normal jobs on Saga.

Fram MDS patched

Dear Fram User,

This morning around 09:05, once again has the Fram metadata server crashed and likely had impact on running jobs.

A mitigating patch was delivered by the vendor yesterday and we used this opportunity to apply it on our metadata servers.

We will keep the system closely monitored and cooperate with the vendor on further stabilizing the system.

Apologies for any inconvenience this may have caused!

Scheduled downtime – NIRD storage expansion – 2nd of April

Update:

  • 2019-04-03 13:25: NIRD and the service platform are back into production.
  • 2019-04-03 10:59: Maintenance work has finished. We are proceeding in starting back the filesystems and services.
  • 2019-04-03 08:22: Disk expansion and rebalancing is finished. HW checks are currently ongoing and shall finish in a couple of hours. Will keep you posted.
  • 2019-04-02 09:55: NIRD filesystems are unmounted from Fram and replicated data is available read-only trough login-trd.nird.sigma2.no
  • 2019-04-02 08:06: Maintenance work has started.

Dear NIRD User,

NIRD and the Service Platform will be under maintenance to expand the disk capacity in Tromsø.

The operations for storage expansion and disk pool rebalancing will start on the 2nd of April at 8:00 am CET and will last for maximum 2 days. During the maintenance, the services running on the NIRD Service Platform and on the NIRD Toolkit will not be available.

During the downtime we plan to make project data mirrored to Trondheim available in read-only mode trough a specially built login node. This solution will be first tested with real load during this downtime, thus we might encounter some technical difficulties.
That being said, to access the remote, mirrored data, please login to login-trd.nird.sigma2.no.

We apologise for the inconvenience.
Metacenter Operations

Slurm Upgrade on Fram

Update: The upgrade is now done. All seems to have gone well, but it can be a good idea to check your jobs.

We will upgrade the queue system (Slurm) on Fram at 11:00 today, from version 17.11 to 18.08. The upgrade is expected to take 5-10 minutes. During that time, queue system commands (squeue, sbatch, etc.) will not work, but running jobs should not be affected.

2 days downtime starting on 25th of April

Update:

  • 2018-04-30 14:46 File system issues are solved now on Fram and access is reopened. Jobs are temporarily on hold due to some troubles with the cooling system in the server room. As soon as that is sorted out, jobs will be permitted again.
  • 2018-04-30 10:15 We are still struggling with the /cluster file system. The problem is escalated to the Vendor. At the moment we do not have a time estimate when Fram is back online, but there is work in progress to fix this as soon as possible, hopefully during the day.
  • 2018-04-27 18:44 Unfortunately there are still problems taking up the Lustre file system on Fram. Issue is caused by an incompatibility hitting routing between IB networks/fabrics on the Lustre object storage servers. The vendor is now planning and working to carry out an emergency update on the system. We are sorry for the trouble.
  • 2018-04-27 16:49 Access to NIRD is reopened now.
  • 2018-04-26 22:50 We are having problems on taking up the Lustre file system on Fram. The issue is reported to the vendor. Additionally, there are some minor issues which must be addressed on NIRD before opening it for production, but we expect reopening the access to both Fram and NIRD during tomorrow.

 

Dear Fram and NIRD user,

A two day downtime is scheduled for week 17. The scheduled maintenance will start on Wednesday, 25th of April, at 09:00 AM and will affect Fram, NIRD and the Service Platform.

During this time we will:
1. Extend NIRD storage space with ~1.1PB.
– The new hardware will be coupled to NIRD and extra disks loaded to the system during these two days.
– Please note that the above advertised storage will not be available at once. Storage space is gradually added as soon as loaded disks are formatted and available to the file system.
– One of our top priorities is to address the inode shortage on $HOME areas.
2. Address file system related bugs on NIRD by upgrading the afferent software and tune some parameters on the servers.
3. Fix broken hardware on Fram.
4. Apply any outstanding patches to both Fram and NIRD.
5. Carry out maintenance work on the cooling system for Fram.

There is a job reservation in place on Fram starting on 08:45 AM 25th of April.  Jobs that cannot complete before that time, will be left pending in the queue with a Reason “ReqNodeNotAvail” and an estimated start time of 2154.  They will be started when the maintenance is over.

We will keep you updated via OpsLog/Twitter.

Thank you for your consideration!
Metacenter Operations