Slurm Upgrade on Fram

Update: The upgrade is now done. All seems to have gone well, but it can be a good idea to check your jobs.

We will upgrade the queue system (Slurm) on Fram at 11:00 today, from version 17.11 to 18.08. The upgrade is expected to take 5-10 minutes. During that time, queue system commands (squeue, sbatch, etc.) will not work, but running jobs should not be affected.

HW problems on Fram and NIRD storages

Updates:

  • 2019-01-11 08:30: We are starting to rebuild the remaining degraded storage pools. The storage vendor is analyzing further logs and working on a new firmware for our systems.
  • 2019-01-07 11:30: The disk system on Fram is hopefully back to normal soon, but further disks in the NIRD filesystem failed during the weekend, and need to be replaced.
  • 2019-01-04 15:29: Raidsets are now rebuilding and we expect them to be finished within 24 hours.

We are experiencing hardware failures on both Fram and NIRD storages. Due to disk losses performance is also slightly degraded at this point.

To mitigate those issues we will have to reseat IO modules on the controllers and this might cause IO hang. Will keep you updated.

Stallo downtime Jaunary 10th 10:00 – January 11th 16:00

2019-11-01 14:37 Update: stallo is back online and in production again!

Due to work on the electrical power infrastructure in the building housing stallo, we need to power off the machine in the given periode. All jobs with walltime beyond the start time of the poweroff will be held pending in the queue until the system is up and running again.

HW failures on Fram storage

Update 2018-12-21: HW is replaced and /cluster file system should be 100% functional again.

We have some hardware failing on the Fram storage needing urgent replacement. For this we have to failover disks served by two Lustre servers to other Lustre server nodes.
Some slow down and short I/O hanging might be encountered on the /cluster file system during the maintenance.

We apologies for the inconvenience.

Login problem on NIRD

Login nodes are behind a load balancer which is resetting all connections every 4 hours. We are working on implementing changes which will make connection more consistent over the time.

If you have problem logging into NIRD, remove login.nird entry from the authorized_keys files.

Queue system on Fram

Dear Fram User,

We are working on improving the queue system on Fram for best resource usage and user experience.
There is an ongoing work to test out new features in the latest versions for our queue system and apply them in production as soon as we are sure they will not have negative impact on jobs.

To give all users a more even chance to get their jobs started, we have now limited the number of jobs per user that the system will try to backfill.

We will keep you updated with new features as we implement them.

Metacenter Operations

Vilje is online

The infiniband error was due to a controller module with bad connection. This has been corrected.

The queueing system is back online. Also: 19 additional nodes has been recovered.

Three jobs were lost. We apologize for the inconvenience.

 

 

Fram up and running again.

2018-23-11 15:24:  Fram is now available again for use. The service was a success and we were able to complete everything on our list, as previously announced via email, nird and nird service platform is already up and running.