Issues on /cluster file system

We have identified a  bug on the /cluster file system which can lead to random job crashes.

The bug is triggered on the Lustre file system by a combination of running Fortran code compiled with Intel MPI.

A bug report is filed now to the storage vendor.

We will keep you updated!

Update 06-04-2018: We have found and fixed a problem on the file servers and with the tests we ran, we can not reproduce the problem anymore.

Thank you for your consideration!
Metacenter Operations

Change in defaults for job placement on islands

Fram has been in production for half a year now, and we’ve gathered enough data to see possible improvements on defaults. One such improvement is related to how jobs are placed with regards to the island topology on Fram. The way Fram is built, the network bandwidth within an island is far better than between islands. For certain types of jobs spanning many compute nodes, being spread over multiple islands can give a negative impact on performance.

To limit this effect we have now changed the default setup so that each job will run within one island, if that does not delay the job too much, as described here: 

https://documentation.sigma2.no/jobs/framjobplacement.html 

Note that this may lead to longer waiting in the queue, in particular for larger jobs. If your job does not depend on high network throughput, the above mentioned document also describes how to override the new default.

Best regards,

Metacenter Operations

Home directory file permissions

In accordance with the Data handling and Storage policy we will shortly enable automatic enforcement of file permissions on your home directories. We expect this to take place after the next maintenance stop.

This means that you may no longer grant other users/groups read or write access to your home directory. Any sharing of data between users must be done through project or work directories.

We take this opportunity to remind you that your home directory contents are treated as private data by the Metacenter staff and will not be shared with other users, even with your supervisor or project leader without your prior, written consent. Should you be unable to give consent, requests will be handled in accordance with applicable laws and regulations.

Please remember to share necessary data as required before changing jobs, leaves of absence and so on.

Best regards,

the Metacenter security team

Issues with $HOME file system – resolved

We are experiencing troubles with the $HOME (/nird/home) file system.
We are working on the problem and try to fix it as soon as possible. Will get back with further information later.

Update:
A lot of files has been generated on the $HOME file system by some of the users, using all the available inodes.
Problem has been remediated around 09:50 in the morning.

Fram Downtime February 8th 2018 07:00 – POSTPONED

Due to delay in the work done by the power company the power outage will be postoned, no new time is currently scheduled  We are sorry for this and any trouble this may cause for you.  We estimate the downtime to be no longer than 4 hours.

A new system reservation will be made when the new outtage is planned. We will need to re-queue any running jobs that is not finished by the time for the outage.

on behalf of the Fram HPCstaff

Steinar Trædal-Henden

Fram Downtime, February the 7th 2018 08:45 – POSTPONED

We need to schedule a downtime of Fram, due to work on the electricity circuits powering the datacenter.  We are sorry for this very short notice, and any trouble this may cause for you.  We estimate the downtime to be no longer than 4 hours.

A system reservation is registered in the queue system to avoid starting jobs which can not finish by 08:15, 7th of February. Jobs with longer walltime, that are already started and are not finished by the start of the downtime will be re-queued.

 

on behalf of the Fram HPCstaff

Steinar Trædal-Henden