We need to have a scheduled downtime on a relatively short notice in order to upgrade the firmware on both Fram and NIRD (including NIRD Toolkit) storages.
This is a critical and mandatory update which will increase stability, performance and reliability of our systems.
The downtime is expected to last no more than a working day.
Fram jobs which can not finish by the 12th of February, are queued up and will not start until the maintenance is finished.
Thank you for your understanding!
As part of the ongoing work of tuning and improving the queue system on Fram for best resource usage and user experience, we recently upgraded Slurm to version 18.08.
This has enabled us to replace the limit of how many jobs per user the system will try to backfill with a more flexible setting.
Jobs are started and backfilled in priority order, and the priority of jobs increase with time as they are pending. With the new setting, only a fixed number of each user’s jobs within a project will increase in priority. The number is currently 10, but will probably be adjusted over time. This setting will make it easier for all users and projects to get jobs started when one user has submitted very many jobs over a short time, while at the same time not preventing jobs from starting when there are free resources.
Note that the setting is per user and project, so if more than one user submit jobs in the same project, each of them will get 10 jobs with priority increasing, and similarly, if one user submits jobs to several projects, the user will get 10 jobs with priority increasing in each project.
The setting is documented here.
- 2019-01-11 08:30: We are starting to rebuild the remaining degraded storage pools. The storage vendor is analyzing further logs and working on a new firmware for our systems.
- 2019-01-07 11:30: The disk system on Fram is hopefully back to normal soon, but further disks in the NIRD filesystem failed during the weekend, and need to be replaced.
- 2019-01-04 15:29: Raidsets are now rebuilding and we expect them to be finished within 24 hours.
We are experiencing hardware failures on both Fram and NIRD storages. Due to disk losses performance is also slightly degraded at this point.
To mitigate those issues we will have to reseat IO modules on the controllers and this might cause IO hang. Will keep you updated.
Update 2018-12-21: HW is replaced and /cluster file system should be 100% functional again.
We have some hardware failing on the Fram storage needing urgent replacement. For this we have to failover disks served by two Lustre servers to other Lustre server nodes.
Some slow down and short I/O hanging might be encountered on the /cluster file system during the maintenance.
We apologies for the inconvenience.
Dear Fram User,
We are working on improving the queue system on Fram for best resource usage and user experience.
There is an ongoing work to test out new features in the latest versions for our queue system and apply them in production as soon as we are sure they will not have negative impact on jobs.
To give all users a more even chance to get their jobs started, we have now limited the number of jobs per user that the system will try to backfill.
We will keep you updated with new features as we implement them.
2018-23-11 15:24: Fram is now available again for use. The service was a success and we were able to complete everything on our list, as previously announced via email, nird and nird service platform is already up and running.
- 2018-11-23 09:45:
- We plan to re-open access to Fram during today. Will keep you updated.
- We are currently running benchmarks and tests on the upgraded system.
- Cooling units are stress-tested now to pinpoint any outstanding issues.
- OpenMPI is upgraded now.
- 2018-11-22 22:48:
- Lustre servers are upgraded on Fram and several tests, benchmarks were run to fine tune parameters.
- First step of the CDU maintenance is carried out now.
- 2018-11-22 11:33:
- NIRD Service Platform is up again.
- Access to NIRD is reopened. Please note that we have now four login nodes and SSH fingerprint is changed.
- 2018-11-21 18:41:
- Needed hardware replacement for NIRD was carried out and all firmware upgrades are finalized on both Tromsø and Trondheim site storage systems.
- NIRD file systems are started back now and we plan to reopen access tomorrow before noon.
- Firmware is updated now on the Fram storage system.
- Several other updates, including the OpenFabrics stack and Lustre, were done in parallel.
- 2018-11-21 08:00: Maintenance has started.
Dear Fram, NIRD and Service Platform User,
On the 21st and 22nd of November we will have a scheduled maintenance on Fram, NIRD and Service Platform.
This will be a comprehensive maintenance on the national HPC and research data infrastructure, ongoing on multiple levels and sites. Due to it’s complexity and amount of work involved, some parts of the infrastructure might require downtime extension for the 23rd of November, too.
The work will include, but will not be limited to:
– firmware upgrades on disks, enclosures, chassis, etc.
– operative system upgrades
– queue system upgrade
– file system upgrades
– kernel upgrades
– upgrade of OpenMPI
– upgrades on the OpenFabrics stack
– maintenance on the cooling system units
Our aim is to enhance the stability and security of the infrastructure, eliminate bugs and enhance performance, while having the shortest downtime possible.
We understand that system unavailability has big impact on your daily work and such we try to bring back our systems functional as soon as possible.
Thank you for your consideration!
Dear Fram User,
Some of you might have experienced sporadic I/O hangs on Fram in the past period.
In many cases the I/O hangs were caused by overloading the RPC queue on the NFS mounted /nird/home file system. This had negative performance impact on the compute nodes, in some cases lead to job crashes.
Therefore we have decided to migrate all Fram user’s $HOME directory from /nird/home/$USER to /cluster/home/$USER, starting with the next upcoming scheduled maintenance. Preparations has been made and some accounts were already synchronized over during past few weeks.
Since today we suddenly lost a big amount of disks on NIRD, to avoid data loss, we have decided to stop all user I/O on NIRD and migrate the remaining user accounts over to Fram.
Starting from today – 2018-11-07 – /nird/home is unmounted from Fram, but will still be available on NIRD. Until next upcoming maintenance we have created a symbolic link from /nird/home to /cluster/home so that eventual scripts can be adjusted.
As soon as NIRD disk issues are remediated, nightly backups will be taken from Fram to /nird/home/$USER/backup/fram.
This step made Fram less dependent on NIRD, thus from this point on, we will be able to schedule maintenance on NIRD, without having impact on running jobs.
Thank you for your understandings!
- 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
- 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.
Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data. This means stopping all jobs and user processes, and logging users out of the systems.
We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today. (Note that the NIRD project areas will _not_ be available until NIRD is up again.)
We will update this post with more information when we know more.
VASP 5.4.4, module named VASP/5.4.4-intel-2018a is now updated with transition state tools (vTST), implicit solvation model (VASPsol) and occupation matrix control, all with unmodified, abfix and noshear modications in code. Binary names should be self-explanatory, please look in bin for all versions.