multiple storage server on Betzy is 100% full, and it is not possible to work with filesystem right now.
Due to this problem we have put reservation to whole Betzy cluster so that new jobs will not be able to run.
We will remove reservation as soon as we cleaned up filesystem usage.
Author: Saerda Halifu
Fram: slurm crashed
Slurm controller on Fram is crashed, we are investigating.
Update 08.07.2022 14:00 : Fram workload manager node (slurm) crashed again and all running jobs are died.
There is hardware issue discovered, we are in contact with vendor and doing the tests. we will keep you updated.
Access to login nodes will be still open until planned Fram downtime.
Fram cooling failure
It appears that the cooling on Fram has failed. The result is that many compute nodes are unavailable. We are investigating.
Update 05.07.2022 : We are still experiencing the same failure; investigating the issue together with vendors. The capacity of Fram might drop suddenly.
Betzy filesystem issue
Dear Betzy cluster users: We are experiencing I/O problem on Betzy filesystem. The filesystem usage is around 73%, but some filesystem servers are more than 90% full. This causes IO slowness, inconsistent performance, and Lustre disconnects. The following directories are affected: /cluster/shared, /cluster/projects, and /cluster/work. To keep the usage down and improve the IO performance and stability on Betzy we ask all users to remove unneeded data from those directories. We are working on moving the data between different disk pools, which will hopefully fix the IO issues. The challenge with that is that the moved data has to be guaranteed unused (files unopened) during the process. We are looking into doing this while in production. If this is not possible, we will have to call for an emergency maintenance stop. We will keep you updated.
Update 04.07.2022 12:30: We manage to balance filesystem usage over the last two weeks, so I/O problems should be resolved, please contact us if you encounter any problem with I/O on Betzy
Potential problem with NIRD storage
Dear NIRD users,
We are experiencing some issue with NIRD-TRD, where we discovered failing disks and degraded pools, TRD filesystem is up and running , but without redundancy, we are working on stabilising the storage. but system might become unavailable.
[UPDATED] NIRD Trondheim – controller maintenance
[UPDATE 2022-02-16 21:00] Services are back into production.
[UPDATE 2022-02-16 14:30] We received some reports about issues related to nodes, stale file handles etc. We are working on a fix and update here if the issue has been resolved.
We are sorry for the inconvenience
Dear NIRD users,
we will have to do some minor maintenance on the file system controllers in Trondheim today.
We don’t expect any major outage but a slight file system performance issue might occur.
Thanks for your understanding
Betzy: The Bus error
This is our current understanding of the Bus errors:
Some jobs tend to allocate more memory on one chosen rank, e.g., rank 0, or one rank per compute node – often the rank that runs on CPU 0. This sometimes results in memory exhaustion on the first NUMA node. If the memory is exhausted on one of the nodes, calling the MPI communication ultimately results in a Bus error. Why that happens is still unclear, and most likely related to some kernel-space drivers not being able to allocate the memory. We are in the process of diagnosing this issue, and have submitted a report to the vendor. However, from our experience that will take very long time to solve. So you are better off finding a workable solution on the application level. That would include checking what is the profile of memory allocation in your application, and making it more even among the numa nodes. You can check the occupation of the NUMA memory nodes by running (e.g., on the login nodes)
clush -w numactl -H | grep free
If you see a big imbalance, and memory being exhausted on one of the numa nodes, you can expect to get the Bus error.
Update 15:00 21.01.2022: We are working with new distro for Betzy compute node, which looks promising, which will eliminate buss error. Distro will be tested during the weekend, and eventually will be put in to production next week.
Update : 15:30 24.01.2022: Bus error is eliminated after we updated Lustre version and MOFED version on our distro. Please contact us if you still encounter Bus error.
Betzy schedular server short service disruption
Dear Betzy users:
We will update infiniband drivers on Betzy slurm controller server from 10:00 tomorrow (05.01.2022).
During this time users can NOT submit new jobs or monitor queue status, but already running jobs will continue to run.
we appreciate your understanding.
Scheduler issue on Betzy
There is currently an issue on Betzy with the scheduler which result in slurm commands does not work and new jobs not being started or long response time to commands.
We are currently investigating the issue and will update once we know what caused it and how it can be resolved.
[Resolved] UiB MATLAB License server is down
Update 2021-05-10: The UiB MATLAB license server is now up and running again.
Dear users,
We have problem with UiB MATLAB license server, the license server is not stable and crashing from time to time, Users using MATLAB software from different clusters will have problem to contact UiB MATLAB license server.
we are working on this issue, and will keep you updated.
We apologise for any inconvenience caused.
Best Regards