2018-23-11 15:24: Fram is now available again for use. The service was a success and we were able to complete everything on our list, as previously announced via email, nird and nird service platform is already up and running.
- 2018-11-23 09:45:
- We plan to re-open access to Fram during today. Will keep you updated.
- We are currently running benchmarks and tests on the upgraded system.
- Cooling units are stress-tested now to pinpoint any outstanding issues.
- OpenMPI is upgraded now.
- 2018-11-22 22:48:
- Lustre servers are upgraded on Fram and several tests, benchmarks were run to fine tune parameters.
- First step of the CDU maintenance is carried out now.
- 2018-11-22 11:33:
- NIRD Service Platform is up again.
- Access to NIRD is reopened. Please note that we have now four login nodes and SSH fingerprint is changed.
- 2018-11-21 18:41:
- Needed hardware replacement for NIRD was carried out and all firmware upgrades are finalized on both Tromsø and Trondheim site storage systems.
- NIRD file systems are started back now and we plan to reopen access tomorrow before noon.
- Firmware is updated now on the Fram storage system.
- Several other updates, including the OpenFabrics stack and Lustre, were done in parallel.
- 2018-11-21 08:00: Maintenance has started.
Dear Fram, NIRD and Service Platform User,
On the 21st and 22nd of November we will have a scheduled maintenance on Fram, NIRD and Service Platform.
This will be a comprehensive maintenance on the national HPC and research data infrastructure, ongoing on multiple levels and sites. Due to it’s complexity and amount of work involved, some parts of the infrastructure might require downtime extension for the 23rd of November, too.
The work will include, but will not be limited to:
– firmware upgrades on disks, enclosures, chassis, etc.
– operative system upgrades
– queue system upgrade
– file system upgrades
– kernel upgrades
– upgrade of OpenMPI
– upgrades on the OpenFabrics stack
– maintenance on the cooling system units
Our aim is to enhance the stability and security of the infrastructure, eliminate bugs and enhance performance, while having the shortest downtime possible.
We understand that system unavailability has big impact on your daily work and such we try to bring back our systems functional as soon as possible.
Thank you for your consideration!
Dear Fram User,
Some of you might have experienced sporadic I/O hangs on Fram in the past period.
In many cases the I/O hangs were caused by overloading the RPC queue on the NFS mounted /nird/home file system. This had negative performance impact on the compute nodes, in some cases lead to job crashes.
Therefore we have decided to migrate all Fram user’s $HOME directory from /nird/home/$USER to /cluster/home/$USER, starting with the next upcoming scheduled maintenance. Preparations has been made and some accounts were already synchronized over during past few weeks.
Since today we suddenly lost a big amount of disks on NIRD, to avoid data loss, we have decided to stop all user I/O on NIRD and migrate the remaining user accounts over to Fram.
Starting from today – 2018-11-07 – /nird/home is unmounted from Fram, but will still be available on NIRD. Until next upcoming maintenance we have created a symbolic link from /nird/home to /cluster/home so that eventual scripts can be adjusted.
As soon as NIRD disk issues are remediated, nightly backups will be taken from Fram to /nird/home/$USER/backup/fram.
This step made Fram less dependent on NIRD, thus from this point on, we will be able to schedule maintenance on NIRD, without having impact on running jobs.
Thank you for your understandings!
- 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
- 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.
Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data. This means stopping all jobs and user processes, and logging users out of the systems.
We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today. (Note that the NIRD project areas will _not_ be available until NIRD is up again.)
We will update this post with more information when we know more.
VASP 5.4.4, module named VASP/5.4.4-intel-2018a is now updated with transition state tools (vTST), implicit solvation model (VASPsol) and occupation matrix control, all with unmodified, abfix and noshear modications in code. Binary names should be self-explanatory, please look in bin for all versions.
Gaussian 16, minor release B.01 is now installed on Fram.
Reduced capacity might be experienced due to planned maintenance of the cooling system. Almost 300 nodes would be excluded from production in order to reduce the general load and provide the possibility for maintenance on the fly.
2018.09.03 13:19 we just lost cooling on all compute nodes in Fram. We are working to get the system back online.
We experience some trouble with the filesystem on fram at the moment. We are working to find the cause of the problem and fix it.
UPDATE, 13:21 The FS is stable again now.
Some users may experience login issues and issues when loading modules.
- 09:50: login-1-2 has network issues and we unfortunately have to reboot it to resolve the issues.