[RESOLVED] Short file system hang on /nird/home on Fram

Update: There was a second hang at around 09. The reason for the hangs has been found and fixed.

There was a short hang on the /nird/home ($HOME) file system on Fram from 08:00 to 08:55 today. The file system is back to normal now. We are investigating the reason for it.

Jobs running in $SCRATCH, $USERWORK or in a project directory have most likely not been affected, but it is probably a good idea to check the status of your jobs.

Shared filesystem on Fram down

Update 2018-06-27 10:57 /cluster file system is up again on Fram.

The shared filesystem on Fram (/cluster) is currently down. We are investigating it, and are trying to get it up again as soon as possible. We will update here when we know more.

 

/cluster file system hanging

Some of the Lustre object storage servers crashed during the night, making parts of the /cluster file system unaccessible. We working on the problem and will keep you updated.

Metacenter Operations

Problems with $HOME on Fram

Some of the compute nodes and additionally Fram login nodes lost connection to the NFS mounted $HOME.
Login nodes were rebooted to cleanup hanging processes and blocking I/O.

We are investigating this issue and working on a solution.

Thank you for your understanding!
Metacenter Operations

$HOME file system availability issues on Fram – FIXED

We are experiencing availability issues for $HOME file system on Fram. The problem is currently under investigation and we are actively working on solving it.
Update 09:30:
Problem is fixed now.
One of the file servers exporting $HOME  went down and the failover didn’t work as intended.

Thank you for your understanding!
Metacenter Operations

Issues on /cluster file system

We have identified a  bug on the /cluster file system which can lead to random job crashes.

The bug is triggered on the Lustre file system by a combination of running Fortran code compiled with Intel MPI.

A bug report is filed now to the storage vendor.

We will keep you updated!

Update 06-04-2018: We have found and fixed a problem on the file servers and with the tests we ran, we can not reproduce the problem anymore.

Thank you for your consideration!
Metacenter Operations

Home directory file permissions

In accordance with the Data handling and Storage policy we will shortly enable automatic enforcement of file permissions on your home directories. We expect this to take place after the next maintenance stop.

This means that you may no longer grant other users/groups read or write access to your home directory. Any sharing of data between users must be done through project or work directories.

We take this opportunity to remind you that your home directory contents are treated as private data by the Metacenter staff and will not be shared with other users, even with your supervisor or project leader without your prior, written consent. Should you be unable to give consent, requests will be handled in accordance with applicable laws and regulations.

Please remember to share necessary data as required before changing jobs, leaves of absence and so on.

Best regards,

the Metacenter security team

Issues with $HOME file system – resolved

We are experiencing troubles with the $HOME (/nird/home) file system.
We are working on the problem and try to fix it as soon as possible. Will get back with further information later.

Update:
A lot of files has been generated on the $HOME file system by some of the users, using all the available inodes.
Problem has been remediated around 09:50 in the morning.