NIRD crash.

NIRD storage system was crashed and unavailable for short period of time.
Due to this crash, users logged in to NIRD and Fram experienced problemes.
The problem is resolved, NIRD storage system is online now.

Please contact us if you still encounter problems.

Note: The export of NIRD to FRAM does not work currently

Vilje filesystem is back

Vilje filesystem has been fixed with good help from DDN and we are now open for business.

Please be aware that some files may have been lost.
Always back up your files.

Quota enforcement on Fram

Dear Fram User,

We have fixed the broken quota indexes on Fram /cluster file system.
Due to heavy disk usage on both Fram and NIRD, we need to enforce quotas on all relevant areas:

  • /cluster/home
  • /cluster/projects
  • /cluster/shared

To be able to control disk usage on the areas mentioned above, group ownerships are enforced nightly.

To avoid job crashes prior to starting jobs, please make sure that unix users and groups you are a member of have enough free quota, be it block or inode quotas.

This Wikipedia page gives good explanation about the block and inode quota types.

To check quotas on Fram, you may use the dusage command. i.e.

# list all user and groups quotas
dusage -a
# for help use
dusage -h

Thank you for your understanding!
Metacenter Operations

Planned maintenance on Fram on 28.11.2019

Update:

  • 2019-11-28 15:00: Fram is opened for production and scheduled jobs were resumed.
  • 2019-11-28 14:33: For more information about enforced file system quota accounting, please see this announcement.
  • 2019-11-28 13:26: Lustre quota indexes are fixed now. Running some additional cleanup and tests.
  • 2019-11-28 08:03: Maintenance has started.

Dear Fram User,

The slave quota indexes on the /cluster file system got broken and as a result of it the quota accounting has become unpredictable. For the same reason some of you may have even experienced job crashes.

To fix this, we will have a two days planned downtime starting from 08:00AM on the 28th of November.

Fram jobs which can not finish by the 28th of November, are queued up and will not start until the maintenance is finished.

Thank you for your consideration!

Metacenter Operations

Saga – BeeGFS problem

Dear Saga User,

There was a problem with /cluster file system on Saga. The issue has been resolved now. Please check your jobs and verify they are running smoothly/or exited clearly.

Apologies for the inconvenience!

Metacenter Operations

FRAM down/up again

Dear Fram User,

Fram compute nodes went down due to the issue with the cooling system. We are currently taking all the compute nodes up again and trying to find a permanent solution.

Apologies for the inconvenience!

Metacenter Operations

Vilje: /work filesystem is partially down.

Dear Vilje cluster users:

/work filesystem on Vilje is partially down. we are working on it.

At the moment it’s very difficult to determine when we can take /work filesystem fully back online.
We will keep you posted.

Best Regards

Fram down

 

2019-11-13-16:15 Fram is up and running again. 

One of the cooling units stoped, causing the other to also stop and all compute nodes went down. 

 

Dear Fram User,

Fram is currently down likely due to issues with the cooling distribution unit.
We are currently investigating the issue and working on placing Fram back into production.

Apologies for the inconvenience!

Metacenter Operations

Slurm upgraded on Saga

Slurm was upgraded to the latest version (19.05.3-2) on Saga today. This includes a fix for the problem with using “srun” for running interactive jobs.

Please let us know if you notice anything that has gone wrong after the upgrade.