Fram development queue

Dear Fram User,

As of today we have adjusted the queue system policies to facilitate code development and testing on Fram and meanwhile limit possible misuse of devel queue.

devel is now adjusted to allow:

  • max 4 node jobs
  • max 30 minutes wall time
  • max 1 job per user

We have additionally introduced a short queue with following settings:

  • max 10 node jobs
  • max 120 minutes wall time
  • max 2 jobs per user

We will continue to monitor and improve the queue system. Please stay tuned.
You may find more information here.

Metacenter Operations

Fram MDS patched

Dear Fram User,

This morning around 09:05, once again has the Fram metadata server crashed and likely had impact on running jobs.

A mitigating patch was delivered by the vendor yesterday and we used this opportunity to apply it on our metadata servers.

We will keep the system closely monitored and cooperate with the vendor on further stabilizing the system.

Apologies for any inconvenience this may have caused!

Fram MDS crashed

Dear Fram User,

Once again has the Fram metadata server crashed and likely had impact on running jobs.
We are in contact with the storage vendor for patching the file system.

Apologies for the inconvenience!

Clean-up of Fram filesystem needed

Dear Fram users,

The Fram filesystem, and most critically /cluster/work and /cluster/home, is running out of inodes, and there is only 8% left. If we run out of inodes it will not be possible to create new files. To avoid loss of data and job crashes we kindly ask all of you to if possible delete files that you no longer need.

Best regards,

Jon, on behalf of the Operations Team

mds crash on fram

2019-03-19 17:55 Secondary MDS server for lustre filesystem crashed between 17:15 and 17:45, And primary MDS server took over and restored filesystem around 17:45 . Some of the jobs running on Fram might be affected. We are investigating the root cause of the incident.

MDS crash 13.03.2019

Main MDS server for lustre filesystem crashed between 14:00 and 14:30, And secondary MDS server took over and restored filesystem around 14:40 . Some of the jobs running on Fram might be affected. We are investigating the root cause of the incident.