Fram off-line: File system issues

Dear Fram Users,

The ongoing problems on FRAM reported July 1st, cause the error message “No space left on device” for various file operations.

The problems are being investigated, and we will keep you updated on the progress.

UPDATE 2020-07-08 14:50: hugemem on Fram is now operating as normal.

UPDATE 2020-07-08 10:35: The file system issues have been resolved and we are operating as normal with the exception of hugemem, which is still unavailable. Please let us know if you’re still experiencing problems. Again we apologize for the inconvenience.

UPDATE 2020-07-08 09:00: Our vendor has corrected the filesystem bug and we should be operating as normal soon. At the moment we’re running some tests which will slow down current jobs running on Fram.

UPDATE 2020-07-07 15:35: The problem on Fram is caused by a bug in the Lustre filesystem. Our vendor is taking over the case to fix the issue. Thank you for your patience, we apologize for the inconvenience.

UPDATE 2020-07-07 09:50 : We are still experiencing file system errors on FRAM, and are working to resolve the issue as soon as possible. Watch this space for updates.

UPDATE 2020-07-06 12:30 : FRAM has been opened again.

UPDATE 2020-07-06 09:50 : The FS is up and running, it seems to be stable and this has also been verified by the vendor. It should be possible to use FRAM within couple of hours.

UPDATE 2020-07-03 17:10 : The FS is up and running but we have decided to keep the machine closed during the weekend so we are sure everything works as it should on Monday. The reason for many recent FRAM downtimes have been caused by storage hardware faults. We are investigating the issue together with the storage vendor.

UPDATE 2020-07-02 13:20 : FRAM is off-line, we are investigating the issues. The machine will probably stay off-line until tomorrow.

UPDATE 2020-07-02 12:10 : Whole file system is still very unstable, we will most likely have to take FRAM down, Slurm reservation created and all users might be kicked out soon.

UPDATE 2020-07-02 11:15 : Whole file system is still very unstable and we are trying to fix the problem.

Metacenter Operations

Reminder: Auto cleanup of Stallo

Dear Stallo users,

From today (25.05.2020) we will enforce the auto cleanup of /global/work. All files with an access date older than 21 days will in a first step set to read-only and at a later point moved to a trash folder.

Please move all files you want to keep to your home folder or to other storage solutions like NIRD.


See also https://hpc-uit.readthedocs.io/en/latest/storage/storage.html#work-scratch-areas


If you have questions or need help, please contact us at migration@metacenter.no

Thank you for your understanding.

Metacenter Operation

Stallo Shutdown

Dear Stallo Users,


Stallo is getting old and will be shut down this year. Hardware failures cause more and more nodes to fail due to high age. The system will stay in production and continue service until at least 1. Oct 2020, the end of the current billing period (2020.1).
We will help you with finding alternatives to your computational and storage needs and with moving your workflows and data to one of our other machines like Betzy, Saga and Fram. News, updated information and howtos will be published on the Stallo documentation as we move closer to the shutdown.
If you have questions, special needs or problems, please contact us at migration@metacenter.no

Thank you for your understanding

UiT HPC staff

Stallo problems / urgent maintenance

Dear Stallo Users,

Due to yesterdays time travel in Slurm / Stallo master node, there is a need of extensive machine maintenance today. There is currently ongoing work to fix reported issues on Stallo. Please pay attention to our info channels and hold new support request emails until we are on top of the manual labour onsite.

Please accept our apologies for the inconvenience these troubles are causing.

UiT HPC staff

Stallo – RAM upgrade

Dear Stallo users,

Stallo slurm master node will have a short downtime for memory upgrades today, Monday 27.4. from 13:00 till 15:00. No slurm jobs will be able to start during that time. This is done in order to avoid future SLURM problems.
We apologize for the short notice. Have a nice day.

UiT HPC staff

Stallo – slurm problem

UPDATE 16-04-2020-14:55: the system should be stable, up and running again

Dear Stallo user,

Stallo is experiencing some problems with the slurm daemon. It is therefore currently not possible to start new jobs on Stallo. Running jobs should not be affected.
We are currently working on fixing the situation.

Thank you for your patience and understanding

HPC staff

FRAM – critical storage issue

UPDATE:

  • 2020-03-12 10:45: Maintenance is finished now and faulty components were replaced. We continue to monitor the storage system.
    Thank you for your understanding.
  • 2020-03-11 10:16: We have to replace one hardware module on the Fram storage system. The maintenance will be carried out keeping the system online. However there will be some short, up to 5 minutes, hiccup while we are failing over components on the redundant path, possibly causing some jobs to crash.
  • 2020-03-05 20:30: Maintenance is over, Fram is online. Jobs that were running before the maintenance may have been re-queued. It’s also possible that some of the jobs were killed, we are sorry for that. if this is the case, you have to resubmit your job.

Dear FRAM users,

We are facing a major issue with FRAM’s storage system. The necessary tasks are being performed to mitigate the issue. We will have to take the whole machine offline to be able to perform the above mentioned tasks.