The electricians did not manage to finalize their work on the electrical power infrastructure in the previouse time slot (January 10-11th.)
So we need to power off the machine again on January 31st.. All jobs with walltime beyond the start time of the poweroff will be held pending in the queue until the system is up and running again.
We have lost some of the data in project NS2345K. we are in the process of recovering the lost data, affected files will hopefully be recovered gradually.
Update 11:50 We have locked the project to avoid interference with the restoration process.
Update 15:25 Recovery process is still scanning through the missing inodes from the snapshots and dispatching the operation to related nodes. When this is done, files will start to recover.
We are currently experiencing a high load on the home file system.
This can also prevent users from accessing the login nodes.
Update: The upgrade is now done. All seems to have gone well, but it can be a good idea to check your jobs.
We will upgrade the queue system (Slurm) on Fram at 11:00 today, from version 17.11 to 18.08. The upgrade is expected to take 5-10 minutes. During that time, queue system commands (squeue, sbatch, etc.) will not work, but running jobs should not be affected.
- 2019-01-11 08:30: We are starting to rebuild the remaining degraded storage pools. The storage vendor is analyzing further logs and working on a new firmware for our systems.
- 2019-01-07 11:30: The disk system on Fram is hopefully back to normal soon, but further disks in the NIRD filesystem failed during the weekend, and need to be replaced.
- 2019-01-04 15:29: Raidsets are now rebuilding and we expect them to be finished within 24 hours.
We are experiencing hardware failures on both Fram and NIRD storages. Due to disk losses performance is also slightly degraded at this point.
To mitigate those issues we will have to reseat IO modules on the controllers and this might cause IO hang. Will keep you updated.
2019-11-01 14:37 Update: stallo is back online and in production again!
Due to work on the electrical power infrastructure in the building housing stallo, we need to power off the machine in the given periode. All jobs with walltime beyond the start time of the poweroff will be held pending in the queue until the system is up and running again.
Update 2018-12-21: HW is replaced and /cluster file system should be 100% functional again.
We have some hardware failing on the Fram storage needing urgent replacement. For this we have to failover disks served by two Lustre servers to other Lustre server nodes.
Some slow down and short I/O hanging might be encountered on the /cluster file system during the maintenance.
We apologies for the inconvenience.
Login nodes are behind a load balancer which is resetting all connections every 4 hours. We are working on implementing changes which will make connection more consistent over the time.
If you have problem logging into NIRD, remove login.nird entry from the authorized_keys files.
Dear Fram User,
We are working on improving the queue system on Fram for best resource usage and user experience.
There is an ongoing work to test out new features in the latest versions for our queue system and apply them in production as soon as we are sure they will not have negative impact on jobs.
To give all users a more even chance to get their jobs started, we have now limited the number of jobs per user that the system will try to backfill.
We will keep you updated with new features as we implement them.
The infiniband error was due to a controller module with bad connection. This has been corrected.
The queueing system is back online. Also: 19 additional nodes has been recovered.
Three jobs were lost. We apologize for the inconvenience.