Quota enforcement on Fram

Dear Fram User,

We have fixed the broken quota indexes on Fram /cluster file system.
Due to heavy disk usage on both Fram and NIRD, we need to enforce quotas on all relevant areas:

  • /cluster/home
  • /cluster/projects
  • /cluster/shared

To be able to control disk usage on the areas mentioned above, group ownerships are enforced nightly.

To avoid job crashes prior to starting jobs, please make sure that unix users and groups you are a member of have enough free quota, be it block or inode quotas.

This Wikipedia page gives good explanation about the block and inode quota types.

To check quotas on Fram, you may use the dusage command. i.e.

# list all user and groups quotas
dusage -a
# for help use
dusage -h

Thank you for your understanding!
Metacenter Operations

Planned maintenance on Fram on 28.11.2019

Update:

  • 2019-11-28 15:00: Fram is opened for production and scheduled jobs were resumed.
  • 2019-11-28 14:33: For more information about enforced file system quota accounting, please see this announcement.
  • 2019-11-28 13:26: Lustre quota indexes are fixed now. Running some additional cleanup and tests.
  • 2019-11-28 08:03: Maintenance has started.

Dear Fram User,

The slave quota indexes on the /cluster file system got broken and as a result of it the quota accounting has become unpredictable. For the same reason some of you may have even experienced job crashes.

To fix this, we will have a two days planned downtime starting from 08:00AM on the 28th of November.

Fram jobs which can not finish by the 28th of November, are queued up and will not start until the maintenance is finished.

Thank you for your consideration!

Metacenter Operations

Fram down

 

2019-11-13-16:15 Fram is up and running again. 

One of the cooling units stoped, causing the other to also stop and all compute nodes went down. 

 

Dear Fram User,

Fram is currently down likely due to issues with the cooling distribution unit.
We are currently investigating the issue and working on placing Fram back into production.

Apologies for the inconvenience!

Metacenter Operations

Enforcing standard quota on $HOME

Dear Fram and Saga user,

As you may know, we have a standard 20GB block quota on $HOME on Fram and Saga HPC resources. This was however not enforced until now, but due to frequent overuse and backup limitations, we are compelled to do it now and will start to be in effect starting on 04.11.2019.

Any project related data shall be moved to /cluster/projects area and unneeded data shall be removed.

We have also implemented a new policy with regards to backups and any files placed under $HOME/nobackup or $HOME/tmp will be excluded.

For more information, please check the documentation pages at https://documentation.sigma2.no/.

Thank you for your understanding!
Metacenter Operations

Planned maintenance on Fram on 16.10.2019

Update:

  • 2019-10-18 14:36 We are ready with the reinstallation, configuration checks, QA and tests. Access to the machine has been reopened and queued jobs are running again.
  • 2019-10-18 06:12 Reinstallation of compute nodes is much slower then anticipated and thus re-opening of the machine is delayed. We do our best to finish the maintenance as soon as possible. In parallel we are conducting tests and benchmarks.
    Will keep you updated.
  • 2019-10-17 08:25 File system servers and infrastructure switches were patched yesterday.
    We are proceeding now with the upgrade of the service and the login nodes.
  • 2019-10-16 08:07 Maintenance has started.

Dear Fram User,

We will have a two days planned downtime starting from 08:00AM on the 16th of October for maintenance on the storage and the file system.

During this time we will, together with the vendor, upgrade the storage firmwares, upgrade the software on the /cluster file system servers and upgrade the operating system on Fram.

This upgrade is necessary to fix the frequent issues with the metadata servers and enhance stability and security of the system.

Fram jobs which can not finish by the 16th of October, are queued up and will not start until the maintenance is finished.

Thank you for your consideration!

Metacenter Operations

Fram machine room cooling problem

Dear Fram cluster users:

We have a problem with the cooling system in the Fram machine room,
due to this, we have to reduce the load on the cluster by reserving the entire cluster, which means no job will run.
We are sorry for the inconvenience, and we will keep you updated.

Update 2019.07.11 08:00: Fram should be fully operational again, we are monitoring the machine and releasing compute nodes back to production.

Update 14:55: Some of the nodes are crashed, which means it’s possible that  some of the jobs get killed

Update 2019.07.03 10:55:  To keep the machine room temperature reasonably low with only one working CDU, we have kept 495 nodes in maintenance state while 197 nodes are in downstate, we will monitor the power consumption in the machine room and release more nodes accordingly.

Update 2019.07.08 11:55:  Fram is expected to be back to full its full capacity on Wednesday, 2019.07.10.

Fram development queue

Dear Fram User,

As of today we have adjusted the queue system policies to facilitate code development and testing on Fram and meanwhile limit possible misuse of devel queue.

devel is now adjusted to allow:

  • max 4 node jobs
  • max 30 minutes wall time
  • max 1 job per user

We have additionally introduced a short queue with following settings:

  • max 10 node jobs
  • max 120 minutes wall time
  • max 2 jobs per user

We will continue to monitor and improve the queue system. Please stay tuned.
You may find more information here.

Metacenter Operations