We apologize for the long downtime and appreciate your patience.
- 2018-04-30 14:46 File system issues are solved now on Fram and access is reopened. Jobs are temporarily on hold due to some troubles with the cooling system in the server room. As soon as that is sorted out, jobs will be permitted again.
- 2018-04-30 10:15 We are still struggling with the /cluster file system. The problem is escalated to the Vendor. At the moment we do not have a time estimate when Fram is back online, but there is work in progress to fix this as soon as possible, hopefully during the day.
- 2018-04-27 18:44 Unfortunately there are still problems taking up the Lustre file system on Fram. Issue is caused by an incompatibility hitting routing between IB networks/fabrics on the Lustre object storage servers. The vendor is now planning and working to carry out an emergency update on the system. We are sorry for the trouble.
- 2018-04-27 16:49 Access to NIRD is reopened now.
- 2018-04-26 22:50 We are having problems on taking up the Lustre file system on Fram. The issue is reported to the vendor. Additionally, there are some minor issues which must be addressed on NIRD before opening it for production, but we expect reopening the access to both Fram and NIRD during tomorrow.
Dear Fram and NIRD user,
A two day downtime is scheduled for week 17. The scheduled maintenance will start on Wednesday, 25th of April, at 09:00 AM and will affect Fram, NIRD and the Service Platform.
During this time we will:
1. Extend NIRD storage space with ~1.1PB.
– The new hardware will be coupled to NIRD and extra disks loaded to the system during these two days.
– Please note that the above advertised storage will not be available at once. Storage space is gradually added as soon as loaded disks are formatted and available to the file system.
– One of our top priorities is to address the inode shortage on $HOME areas.
2. Address file system related bugs on NIRD by upgrading the afferent software and tune some parameters on the servers.
3. Fix broken hardware on Fram.
4. Apply any outstanding patches to both Fram and NIRD.
5. Carry out maintenance work on the cooling system for Fram.
There is a job reservation in place on Fram starting on 08:45 AM 25th of April. Jobs that cannot complete before that time, will be left pending in the queue with a Reason “ReqNodeNotAvail” and an estimated start time of 2154. They will be started when the maintenance is over.
We will keep you updated via OpsLog/Twitter.
Thank you for your consideration!
We are experiencing availability issues for $HOME file system on Fram. The problem is currently under investigation and we are actively working on solving it.
Problem is fixed now.
One of the file servers exporting $HOME went down and the failover didn’t work as intended.
Thank you for your understanding!
We have identified a bug on the /cluster file system which can lead to random job crashes.
The bug is triggered on the Lustre file system by a combination of running Fortran code compiled with Intel MPI.
A bug report is filed now to the storage vendor.
We will keep you updated!
Update 06-04-2018: We have found and fixed a problem on the file servers and with the tests we ran, we can not reproduce the problem anymore.
Thank you for your consideration!
Fram has been in production for half a year now, and we’ve gathered enough data to see possible improvements on defaults. One such improvement is related to how jobs are placed with regards to the island topology on Fram. The way Fram is built, the network bandwidth within an island is far better than between islands. For certain types of jobs spanning many compute nodes, being spread over multiple islands can give a negative impact on performance.
To limit this effect we have now changed the default setup so that each job will run within one island, if that does not delay the job too much, as described here:
Note that this may lead to longer waiting in the queue, in particular for larger jobs. If your job does not depend on high network throughput, the above mentioned document also describes how to override the new default.
In accordance with the Data handling and Storage policy we will shortly enable automatic enforcement of file permissions on your home directories. We expect this to take place after the next maintenance stop.
This means that you may no longer grant other users/groups read or write access to your home directory. Any sharing of data between users must be done through project or work directories.
We take this opportunity to remind you that your home directory contents are treated as private data by the Metacenter staff and will not be shared with other users, even with your supervisor or project leader without your prior, written consent. Should you be unable to give consent, requests will be handled in accordance with applicable laws and regulations.
Please remember to share necessary data as required before changing jobs, leaves of absence and so on.
the Metacenter security team
There is a need to remove some excess air in the primary watercooling loop for Fram. The Downtime is extimated to last 2h15m.
- 28.02.2018 11:59: Downtime is extended with one hour until 13:00 o’clock.
- 28.02.2018 13:10: Downtime is finished.
We are experiencing troubles with the $HOME (/nird/home) file system.
We are working on the problem and try to fix it as soon as possible. Will get back with further information later.
A lot of files has been generated on the $HOME file system by some of the users, using all the available inodes.
Problem has been remediated around 09:50 in the morning.
Due to delay in the work done by the power company the power outage will be postoned, no new time is currently scheduled We are sorry for this and any trouble this may cause for you. We estimate the downtime to be no longer than 4 hours.
A new system reservation will be made when the new outtage is planned. We will need to re-queue any running jobs that is not finished by the time for the outage.
on behalf of the Fram HPCstaff
We need to schedule a downtime of Fram, due to work on the electricity circuits powering the datacenter. We are sorry for this very short notice, and any trouble this may cause for you. We estimate the downtime to be no longer than 4 hours.
A system reservation is registered in the queue system to avoid starting jobs which can not finish by 08:15, 7th of February. Jobs with longer walltime, that are already started and are not finished by the start of the downtime will be re-queued.
on behalf of the Fram HPCstaff