Maintenance on NIRD, NIRD Toolkit, Fram and Saga, 20th April -24th April

Dear NIRD, NIRD Toolkit, Fram and Saga User,

We will have a four day long scheduled maintenance on NIRD, NIRD Toolkit, Fram and Saga starting on the 20th of April, 09:00 AM.

During the maintenance we will:

  • carry out software and firmware updates on all systems

Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram and Saga too.

Login services to NIRD, NIRD-toolkit, Fram and Saga will be disabled during the maintenance

Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.

Please accept our apologies for the inconvenience this downtime is causing.

Metacenter Operations

FRAM – critical storage issue

UPDATE:

  • 2020-03-12 10:45: Maintenance is finished now and faulty components were replaced. We continue to monitor the storage system.
    Thank you for your understanding.
  • 2020-03-11 10:16: We have to replace one hardware module on the Fram storage system. The maintenance will be carried out keeping the system online. However there will be some short, up to 5 minutes, hiccup while we are failing over components on the redundant path, possibly causing some jobs to crash.
  • 2020-03-05 20:30: Maintenance is over, Fram is online. Jobs that were running before the maintenance may have been re-queued. It’s also possible that some of the jobs were killed, we are sorry for that. if this is the case, you have to resubmit your job.

Dear FRAM users,

We are facing a major issue with FRAM’s storage system. The necessary tasks are being performed to mitigate the issue. We will have to take the whole machine offline to be able to perform the above mentioned tasks.

FRAM – file system issue

Dear FRAM user,
We are facing some minor issue(s) with FRAM’s file system. The necessary tasks are being performed to mitigate the issue.

The above mentioned should not cause any downtime.

UPDATE: 28.02. / 13:20: we have lost the filesystem for couple of minutes, please check your jobs and get back to us in case of any problem(s) …

Thank you for your patience and understanding

Fram: Interconnect network manager crashed

Dear Fram users:

Fram interconnect network manager crashed yesterday at 15:34, which caused all compute nodes had degraded routing information. This can cause Slurm jobs crash with a communication error.
Interconnect network manager is running again, and all compute nodes have the latest routing information, and communication between the compute nodes are restored.
We apologize for the inconvenience if you have any question please don’t hesitate to contact support.

UPDATED: Problem with Access to Projects

Quite a few users have lost access to their project(s) on Nird and all clusters during the weekend. This was due to a bug in the user administration software. The bug has been identified, and we are working on rolling back the changes.

We will update this page when access has been restored.

Update 12:30: Problem resolved. Project access has now been restored. If you still have problems, please contact support at sigma@uninett.no

Update: This applies to all systems, not only Fram and Saga.

Backup policy changes

Dear Fram and Saga User,

As you may remember quotas have been enforced on $HOME areas during October past year. This has been carried out only for users having less then 20GiB in their Fram or Saga home folders.

Because of repeated space related issues on NIRD home folders and due to the backups taken from either Fram or Saga, we had to change our backup policies and exclude backup for users using more then the 20GiB in their Fram or Saga $HOME areas.

If you manage to clean up your $HOME on Fram and/or Saga and decrease your $HOME usage below 20GiB, we can then enforce quotas on it and re-enable the backup.

Thank you for your understanding!

Metacenter Operations

Quota enforcement on Fram

Dear Fram User,

We have fixed the broken quota indexes on Fram /cluster file system.
Due to heavy disk usage on both Fram and NIRD, we need to enforce quotas on all relevant areas:

  • /cluster/home
  • /cluster/projects
  • /cluster/shared

To be able to control disk usage on the areas mentioned above, group ownerships are enforced nightly.

To avoid job crashes prior to starting jobs, please make sure that unix users and groups you are a member of have enough free quota, be it block or inode quotas.

This Wikipedia page gives good explanation about the block and inode quota types.

To check quotas on Fram, you may use the dusage command. i.e.

# list all user and groups quotas
dusage -a
# for help use
dusage -h

Thank you for your understanding!
Metacenter Operations

Planned maintenance on Fram on 28.11.2019

Update:

  • 2019-11-28 15:00: Fram is opened for production and scheduled jobs were resumed.
  • 2019-11-28 14:33: For more information about enforced file system quota accounting, please see this announcement.
  • 2019-11-28 13:26: Lustre quota indexes are fixed now. Running some additional cleanup and tests.
  • 2019-11-28 08:03: Maintenance has started.

Dear Fram User,

The slave quota indexes on the /cluster file system got broken and as a result of it the quota accounting has become unpredictable. For the same reason some of you may have even experienced job crashes.

To fix this, we will have a two days planned downtime starting from 08:00AM on the 28th of November.

Fram jobs which can not finish by the 28th of November, are queued up and will not start until the maintenance is finished.

Thank you for your consideration!

Metacenter Operations

Fram down

 

2019-11-13-16:15 Fram is up and running again. 

One of the cooling units stoped, causing the other to also stop and all compute nodes went down. 

 

Dear Fram User,

Fram is currently down likely due to issues with the cooling distribution unit.
We are currently investigating the issue and working on placing Fram back into production.

Apologies for the inconvenience!

Metacenter Operations