11:30 15-09-2020 [Update 7]: Quick heads-up: We are trying to put one of the storage servers back into production. This could result in some users/jobs experiencing some short hangs. If you are in doubt about the behaviour of your jobs, please, do not hesitate to contact us at firstname.lastname@example.org.
14:30 14-09-2020 [Update 6]: Most compute nodes are running now with the old lustre client. So, what regards the most recent issues, it should be safe to submit jobs. Unfortunately, this also means that the «hung io-wait issue» may happen again. Just contact us via email@example.com in case you continue to have file system issues.
12:15 14-09-2020 [Update 5]: We found the reason for the behaviour many users have reported (problems with the module system, crashes, etc). It seems the new file system client causes this. So, the only immediate “solution” is to go back to the old version of the client. This may cause other issues, however, they are less severe than what we see now. We will inform here if it is safe to submit jobs.
10:30 14-09-2020 [Update 4]: Over the weekend, on the majority of compute nodes the lustre client for the parallel file system was updated. However, users are still reporting issues, particularly, when loading modules. It seems that the module system is not configured correctly on the updated nodes. We are looking into fixing the issue and keep you up-to-date here.
Sorry for the inconvenience!
15:00 11-09-2020[Update 3]: We are currently upgrading lustre filesystem clients to mitigate a «hung io-wait issue». We are also at reduced capacity performance-wice as one of eight io-servers are down. Full production is to be expected from Monday morning. A small hang is expected when io-server i phased in. We expect hung io-wait to go away during next two weeks as clients are upgraded
20:50 10-09-2020[Update 2] : Sorry to inform that we are still having some issues and vendor has been contacted
13:15 10-09-2020[Update 1] : The file system is partially back in operation. Which means you may use Fram but the performance will be sub-optimal. Some jobs may be affected when we try to bring back a object storage latter today.
08:15 10-09-2020 : We are experiencing some issues with the Fram file system and working on fix. Sorry for the inconvenience.