Stallo Shutdown

Dear Stallo Users,


Stallo is getting old and will be shut down this year. Hardware failures cause more and more nodes to fail due to high age. The system will stay in production and continue service until at least 1. Oct 2020, the end of the current billing period (2020.1).
We will help you with finding alternatives to your computational and storage needs and with moving your workflows and data to one of our other machines like Betzy, Saga and Fram. News, updated information and howtos will be published on the Stallo documentation as we move closer to the shutdown.
If you have questions, special needs or problems, please contact us at migration@metacenter.no

Thank you for your understanding

UiT HPC staff

Stallo problems / urgent maintenance

Dear Stallo Users,

Due to yesterdays time travel in Slurm / Stallo master node, there is a need of extensive machine maintenance today. There is currently ongoing work to fix reported issues on Stallo. Please pay attention to our info channels and hold new support request emails until we are on top of the manual labour onsite.

Please accept our apologies for the inconvenience these troubles are causing.

UiT HPC staff

Stallo – RAM upgrade

Dear Stallo users,

Stallo slurm master node will have a short downtime for memory upgrades today, Monday 27.4. from 13:00 till 15:00. No slurm jobs will be able to start during that time. This is done in order to avoid future SLURM problems.
We apologize for the short notice. Have a nice day.

UiT HPC staff

Stallo – slurm problem

UPDATE 16-04-2020-14:55: the system should be stable, up and running again

Dear Stallo user,

Stallo is experiencing some problems with the slurm daemon. It is therefore currently not possible to start new jobs on Stallo. Running jobs should not be affected.
We are currently working on fixing the situation.

Thank you for your patience and understanding

HPC staff

FRAM – critical storage issue

UPDATE:

  • 2020-03-12 10:45: Maintenance is finished now and faulty components were replaced. We continue to monitor the storage system.
    Thank you for your understanding.
  • 2020-03-11 10:16: We have to replace one hardware module on the Fram storage system. The maintenance will be carried out keeping the system online. However there will be some short, up to 5 minutes, hiccup while we are failing over components on the redundant path, possibly causing some jobs to crash.
  • 2020-03-05 20:30: Maintenance is over, Fram is online. Jobs that were running before the maintenance may have been re-queued. It’s also possible that some of the jobs were killed, we are sorry for that. if this is the case, you have to resubmit your job.

Dear FRAM users,

We are facing a major issue with FRAM’s storage system. The necessary tasks are being performed to mitigate the issue. We will have to take the whole machine offline to be able to perform the above mentioned tasks.

FRAM – file system issue

Dear FRAM user,
We are facing some minor issue(s) with FRAM’s file system. The necessary tasks are being performed to mitigate the issue.

The above mentioned should not cause any downtime.

UPDATE: 28.02. / 13:20: we have lost the filesystem for couple of minutes, please check your jobs and get back to us in case of any problem(s) …

Thank you for your patience and understanding