Dear NIRD and NIRD Toolkit User,
After a prolonged downtime due to system failures beyond our control and field of responsibility, access to NIRD is finally reopened.
The vendor has replaced the failing hardware and we are finally back online. Some disk pools are still under rebuild and should be finished in few hours. Until then, you might encounter slight performance loss.
We will proceed in taking up the Service Platform during today.
Thank you for your understanding and patience!
2019-02-13 10:00 stallo is up and running and available for general use again.
Stallo will be shut down at 14:00 due to electrical work in the building housing stallo.
- 2019-02-26 08:57: Fram parts have arrived and were installed yesterday. The vendor will start rebuilding the disk pools on Fram today.
- 2019-02-22 09:10: Access to NIRD is reopened. We will proceed in taking up the Service Platform during today. Some disk pools are still under rebuild and should be finished in few hours. So until then, you might encounter slight performance loss.
- 2019-02-21 15:40: Most of disk rebuild on NIRD are finished. NIRD file system is started and we will proceed with opening access to NIRD as soon as remaining network issues are sorted out on the backbone.
Due to logistic issues, Fram parts are held back at the customs. The vendor sent a second batch of parts through another logistic company and ETA is Monday morning.
We will post a message when access to NIRD is re-opened.
- 2019-02-19 14:10: Disk rebuilds on NIRD have reached 40%. Current ETA for NIRD is Thursday morning.
We are still waiting for the Fram parts to arrive to Tromsø.
- 2019-02-18 16:36: Some disk pools must be rebuilt once again for NIRD, thus delaying the opening once more. We are terribly sorry for this. Will continue updating the log as soon as new information is available.
- 2019-02-18 10:10:
NIRD storage is stabilized now and the vendor will do a new attempt of taking the system back online during today. At this stage it is still uncertain when Fram can be put back into production.
- 2019-02-15 11:18: We are still experiencing problems with the storage system on Fram. Disks begun to mass-fail once again after the system seemed to be stable during the night. We are depending on the vendor to resolve these issues and we are working closely with them.
Based on the new instability we can not give an estimate for when the system will be ready for general use again. This is an unfortunate situation and we understand the impact on you, and thus we try all possible solutions to keep your data safe and bring up the system as soon as possible.The OpsLog will be updated with new information when the status of the situation changes.
- 2019-02-14 13:17: Due to missing parts, and the size of the storage, disk recovery is progressing slowly ahead on approximately 50% reduced performance. Current ETA are:
- Fram: 15.02.2019
- NIRD: 19.02.2019
- Service Plattform: 19.02.2019
- 2019-02-13 19:07: Communication with the missing storage enclosures were re-established and disk pools are rebuilding at this time. Unfortunately we can not reopen machines until disk pools are stabilized. We will have a new round of checks and risk analysis tomorrow morning. Will keep you updated here.
- 2019-02-13 11:33: Some of the parts arrived to the datacenter and we are working with the vendor on replacing and pathing the firmware on Fram. More details to follow as we know more.
- 2019-02-12 15:38: NIRD Tromsø and Fram storages have each one disk enclosure which failed. We are waiting for replacement parts to arrive. After replacement we will have to rebuild disk pools before re-opening machines for production.Current estimate is tomorrow evening. Will keep you updated.
- 2019-02-12 12:36: Firmware upgrade on NIRD is finished. We are proceeding to start back NIRD services. Will keep you posted.
- 2019-02-12 08:17: Maintenance has started.
- 2019-02-11 13:20: Due to the disk problems accellerating during the weekend, we have now changed the maintenance stop reservation so no new jobs will start until the maintenance is done. Already running jobs will not be affected, but no new jobs will start. This has been done to reduce the risk of data loss.
We need to have a scheduled downtime on a relatively short notice in order to upgrade the firmware on both Fram and NIRD (including NIRD Toolkit) storages.
This is a critical and mandatory update which will increase stability, performance and reliability of our systems.
The downtime is expected to last no more than a working day.
Fram jobs which can not finish by the 12th of February, are queued up and will not start until the maintenance is finished.
Thank you for your understanding!
The electrician work in the building housing stallo showed a defect Residual-current device (RCDs). This has to be replaced, an order for a new device is in place, and once it arrive one need to cut the power in the building to replace it.
While we (HPC staff) are waitin for more information it is decided that jobs that are not finished by february 8th 08:00 will not start.
stallo is now up again and available for normal use.
The electical work have met unforseen problems, one aggregate will not power on. They need to continue the work further into this evening.
We are experiencing unstable network connection between NIRD and Fram. On Fram login nodes, some of the project area might be missing.
Please try to check both login nodes (login-1-1,login-1-2) for your project area. Currently all project areas are mounted on login-1-2.
We are working on this issue and will keep you updated.
We apologize for the inconvenience it may caused you.
As part of the ongoing work of tuning and improving the queue system on Fram for best resource usage and user experience, we recently upgraded Slurm to version 18.08.
This has enabled us to replace the limit of how many jobs per user the system will try to backfill with a more flexible setting.
Jobs are started and backfilled in priority order, and the priority of jobs increase with time as they are pending. With the new setting, only a fixed number of each user’s jobs within a project will increase in priority. The number is currently 10, but will probably be adjusted over time. This setting will make it easier for all users and projects to get jobs started when one user has submitted very many jobs over a short time, while at the same time not preventing jobs from starting when there are free resources.
Note that the setting is per user and project, so if more than one user submit jobs in the same project, each of them will get 10 jobs with priority increasing, and similarly, if one user submits jobs to several projects, the user will get 10 jobs with priority increasing in each project.
The setting is documented here.
The electricians did not manage to finalize their work on the electrical power infrastructure in the previouse time slot (January 10-11th.)
So we need to power off the machine again on January 31st.. All jobs with walltime beyond the start time of the poweroff will be held pending in the queue until the system is up and running again.