Fram: slurm crashed

Slurm controller on Fram is crashed, we are investigating.

Update 08.07.2022 14:00 : Fram workload manager node (slurm) crashed again and all running jobs are died.

There is hardware issue discovered, we are in contact with vendor and doing the tests. we will keep you updated.

Access to login nodes will be still open until planned Fram downtime.

Betzy filesystem issue

Dear Betzy cluster users:

We are experiencing I/O problem on Betzy filesystem. The filesystem usage is around 73%, but some filesystem servers are more than 90% full. This causes IO slowness, inconsistent performance, and Lustre disconnects. The following directories are affected: /cluster/shared, /cluster/projects, and /cluster/work. To keep the usage down and improve the IO performance and stability on Betzy we ask all users to remove unneeded data from those directories.

We are working on moving the data between different disk pools, which will hopefully fix the IO issues. The challenge with that is that the moved data has to be guaranteed unused (files unopened) during the process. We are looking into doing this while in production. If this is not possible, we will have to call for an emergency maintenance stop.

We will keep you updated.


Update 04.07.2022 12:30: We manage to balance filesystem usage over the last two weeks, so I/O problems should be resolved, please contact us if you encounter any problem with I/O on Betzy

[UPDATED] NIRD Trondheim – controller maintenance

[UPDATE 2022-02-16 21:00] Services are back into production.

[UPDATE 2022-02-16 14:30] We received some reports about issues related to nodes, stale file handles etc. We are working on a fix and update here if the issue has been resolved.

We are sorry for the inconvenience

Dear NIRD users,

we will have to do some minor maintenance on the file system controllers in Trondheim today.

We don’t expect any major outage but a slight file system performance issue might occur.

Thanks for your understanding

Betzy: The Bus error

This is our current understanding of the Bus errors:

Some jobs tend to allocate more memory on one chosen rank, e.g., rank 0, or one rank per compute node – often the rank that runs on CPU 0. This sometimes results in memory exhaustion on the first NUMA node. If the memory is exhausted on one of the nodes, calling the MPI communication ultimately results in a Bus error. Why that happens is still unclear, and most likely related to some kernel-space drivers not being able to allocate the memory. We are in the process of diagnosing this issue, and have submitted a report to the vendor. However, from our experience that will take very long time to solve. So you are better off finding a workable solution on the application level. That would include checking what is the profile of memory allocation in your application, and making it more even among the numa nodes. You can check the occupation of the NUMA memory nodes by running (e.g., on the login nodes)

clush -w numactl -H | grep free

If you see a big imbalance, and memory being exhausted on one of the numa nodes, you can expect to get the Bus error.

Update 15:00 21.01.2022: We are working with new distro for Betzy compute node, which looks promising, which will eliminate buss error. Distro will be tested during the weekend, and eventually will be put in to production next week.

Update : 15:30 24.01.2022: Bus error is eliminated after we updated Lustre version and MOFED version on our distro. Please contact us if you still encounter Bus error.

[Resolved] UiB MATLAB License server is down

Update 2021-05-10: The UiB MATLAB license server is now up and running again.

Dear users,
We have problem with UiB MATLAB license server, the license server is not stable and crashing from time to time, Users using MATLAB software from different clusters will have problem to contact UiB MATLAB license server.

we are working on this issue, and will keep you updated.

We apologise for any inconvenience caused.

Best Regards

[SOLVED] Betzy downtime May 11,2021

UPDATE: The maintenance stop went well, and Betzy is back in production again.

Dear Betzy users,
We will have planned downtime at 11.05.2021, from 09:00 to 15:00. During this time we will expand storage system on Betzy. All compute nodes are reserved, submitted jobs which will not be able to finish before the downtime will not start.

Please contact us if you have any question.

Best Regards

Support team.