We have identified a bug on the /cluster file system which can lead to random job crashes.
The bug is triggered on the Lustre file system by a combination of running Fortran code compiled with Intel MPI.
A bug report is filed now to the storage vendor.
We will keep you updated!
Update 06-04-2018: We have found and fixed a problem on the file servers and with the tests we ran, we can not reproduce the problem anymore.
Thank you for your consideration!
Fram has been in production for half a year now, and we’ve gathered enough data to see possible improvements on defaults. One such improvement is related to how jobs are placed with regards to the island topology on Fram. The way Fram is built, the network bandwidth within an island is far better than between islands. For certain types of jobs spanning many compute nodes, being spread over multiple islands can give a negative impact on performance.
To limit this effect we have now changed the default setup so that each job will run within one island, if that does not delay the job too much, as described here:
Note that this may lead to longer waiting in the queue, in particular for larger jobs. If your job does not depend on high network throughput, the above mentioned document also describes how to override the new default.
In accordance with the Data handling and Storage policy we will shortly enable automatic enforcement of file permissions on your home directories. We expect this to take place after the next maintenance stop.
This means that you may no longer grant other users/groups read or write access to your home directory. Any sharing of data between users must be done through project or work directories.
We take this opportunity to remind you that your home directory contents are treated as private data by the Metacenter staff and will not be shared with other users, even with your supervisor or project leader without your prior, written consent. Should you be unable to give consent, requests will be handled in accordance with applicable laws and regulations.
Please remember to share necessary data as required before changing jobs, leaves of absence and so on.
the Metacenter security team
There is a need to remove some excess air in the primary watercooling loop for Fram. The Downtime is extimated to last 2h15m.
- 28.02.2018 11:59: Downtime is extended with one hour until 13:00 o’clock.
- 28.02.2018 13:10: Downtime is finished.
We are experiencing troubles with the $HOME (/nird/home) file system.
We are working on the problem and try to fix it as soon as possible. Will get back with further information later.
A lot of files has been generated on the $HOME file system by some of the users, using all the available inodes.
Problem has been remediated around 09:50 in the morning.
Due to delay in the work done by the power company the power outage will be postoned, no new time is currently scheduled We are sorry for this and any trouble this may cause for you. We estimate the downtime to be no longer than 4 hours.
A new system reservation will be made when the new outtage is planned. We will need to re-queue any running jobs that is not finished by the time for the outage.
on behalf of the Fram HPCstaff
We need to schedule a downtime of Fram, due to work on the electricity circuits powering the datacenter. We are sorry for this very short notice, and any trouble this may cause for you. We estimate the downtime to be no longer than 4 hours.
A system reservation is registered in the queue system to avoid starting jobs which can not finish by 08:15, 7th of February. Jobs with longer walltime, that are already started and are not finished by the start of the downtime will be re-queued.
on behalf of the Fram HPCstaff
The SLURM queue system hung on Fram.
The problem has been remediated and the queue system is functional again since approximately 09:55.
This is the Metacenter operations log with announcements for the national e-infrastructure resources.
Please bookmark this page or subscribe to our RSS feed or our Twitter channel @MetacenterOps.