Dear NIRD users,
We are experiencing some issue with NIRD-TRD, where we discovered failing disks and degraded pools, TRD filesystem is up and running , but without redundancy, we are working on stabilising the storage. but system might become unavailable.
[UPDATE 2022-02-16 21:00] Services are back into production.
[UPDATE 2022-02-16 14:30] We received some reports about issues related to nodes, stale file handles etc. We are working on a fix and update here if the issue has been resolved.
We are sorry for the inconvenience
Dear NIRD users,
we will have to do some minor maintenance on the file system controllers in Trondheim today.
We don’t expect any major outage but a slight file system performance issue might occur.
Thanks for your understanding
This is our current understanding of the Bus errors:
Some jobs tend to allocate more memory on one chosen rank, e.g., rank 0, or one rank per compute node – often the rank that runs on CPU 0. This sometimes results in memory exhaustion on the first NUMA node. If the memory is exhausted on one of the nodes, calling the MPI communication ultimately results in a Bus error. Why that happens is still unclear, and most likely related to some kernel-space drivers not being able to allocate the memory. We are in the process of diagnosing this issue, and have submitted a report to the vendor. However, from our experience that will take very long time to solve. So you are better off finding a workable solution on the application level. That would include checking what is the profile of memory allocation in your application, and making it more even among the numa nodes. You can check the occupation of the NUMA memory nodes by running (e.g., on the login nodes)
clush -w numactl -H | grep free
If you see a big imbalance, and memory being exhausted on one of the numa nodes, you can expect to get the Bus error.
Update 15:00 21.01.2022: We are working with new distro for Betzy compute node, which looks promising, which will eliminate buss error. Distro will be tested during the weekend, and eventually will be put in to production next week.
Update : 15:30 24.01.2022: Bus error is eliminated after we updated Lustre version and MOFED version on our distro. Please contact us if you still encounter Bus error.
Dear Betzy users:
We will update infiniband drivers on Betzy slurm controller server from 10:00 tomorrow (05.01.2022).
During this time users can NOT submit new jobs or monitor queue status, but already running jobs will continue to run.
we appreciate your understanding.
There is currently an issue on Betzy with the scheduler which result in slurm commands does not work and new jobs not being started or long response time to commands.
We are currently investigating the issue and will update once we know what caused it and how it can be resolved.
Update 2021-05-10: The UiB MATLAB license server is now up and running again.
We have problem with UiB MATLAB license server, the license server is not stable and crashing from time to time, Users using MATLAB software from different clusters will have problem to contact UiB MATLAB license server.
we are working on this issue, and will keep you updated.
We apologise for any inconvenience caused.
UPDATE: The maintenance stop went well, and Betzy is back in production again.
Dear Betzy users,
We will have planned downtime at 11.05.2021, from 09:00 to 15:00. During this time we will expand storage system on Betzy. All compute nodes are reserved, submitted jobs which will not be able to finish before the downtime will not start.
Please contact us if you have any question.
Dear Fram users,
We have problem with Fram compute nodes, there are about 870 nodes is down due to unknown reason, we are working on the issue, and will keep you updated.
Update 2020-12-22, 20:05: Most of the compute nodes have now been brought back online. There are still a few nodes that needs more checking before being made available for jobs.
Update 2020-12-22, 18:04: The cooling system has been stable for the last hour after making some adjustments together with the vendor. We are slowly bringing up the nodes.
Update 2020-12-22, 16:01: In order to keep the cooling as stable as possible, we have decided to take down all high memory nodes. This way we can keep some of the normal compute nodes up for the time being. We are also working together with the vendor to make adjustments on the cooling system to ensure continued stability.
We are very sorry about the inconvenience.
Update 2020-12-22, 13:41: We have identified the cause to be the cooling system and are working on mitigating the issues. Most of the compute nodes must remain down while doing so, unfortunately.
Update 2020-12-24 10:30: Compute nodes shutdown again due to electrical problems in machine room, problem has been resolved according to machine room service department, we are working to take up all nodes.
Update 2020-12-24 12:10: Most of the compute nodes on Fram is back online.
Dear users on Saga,
currently, usage on Saga’s parallel file system (everything under
/cluster) is at about 93%. Already, some of the file system servers are not accepting new data. If usage increases even further, soon the performance of the parallel file system may drop significantly, then some users may experience data loss and finally the whole cluster may come to a complete halt.
Therefore, we are kindly asking all users with large usage (check with the command
dusage) to cleanup unneeded data. Please, check all storage locations you’re storing data, that is,
$USERWORK, project folders (
/cluster/projects/...) and shared folders (
/cluster/shared/...). Particularly, we’re asking users whose
$HOME quota is not (yet) enforced (see line with
$HOME in example below) to reduce their usage as soon as possible. Quota for
$HOME if set is 20 GiB.
[saerda@login-3.SAGA ~]$ dusage -u saerda
Block quota usage on: SAGA
File system User/Group Usage SoftLimit HardLimit
saerda_g $HOME 6.9 TiB 0 Bytes 0 Bytes
saerda saerda (u) 2.8 GiB 0 Bytes 0 Bytes
In parallel, we are trying to help users to reduce their usage and to increase the capacity of the file system, but these measures usually take time.
Many thanks in advance!
Dear Fram users,
We have to do emergency maintenance on Fram storage system, one of the controller has to be rebooted to eliminate errors, during the maintenance /cluster filesystem speed will be degraded. we will update you here.
11:50 Maintenance is over, controller is rebooted. Filesystem performance is back to normal.