Fram off-line: File system issues

Dear Fram Users,

The ongoing problems on FRAM reported July 1st, cause the error message “No space left on device” for various file operations.

The problems are being investigated, and we will keep you updated on the progress.

UPDATE 2020-07-08 14:50: hugemem on Fram is now operating as normal.

UPDATE 2020-07-08 10:35: The file system issues have been resolved and we are operating as normal with the exception of hugemem, which is still unavailable. Please let us know if you’re still experiencing problems. Again we apologize for the inconvenience.

UPDATE 2020-07-08 09:00: Our vendor has corrected the filesystem bug and we should be operating as normal soon. At the moment we’re running some tests which will slow down current jobs running on Fram.

UPDATE 2020-07-07 15:35: The problem on Fram is caused by a bug in the Lustre filesystem. Our vendor is taking over the case to fix the issue. Thank you for your patience, we apologize for the inconvenience.

UPDATE 2020-07-07 09:50 : We are still experiencing file system errors on FRAM, and are working to resolve the issue as soon as possible. Watch this space for updates.

UPDATE 2020-07-06 12:30 : FRAM has been opened again.

UPDATE 2020-07-06 09:50 : The FS is up and running, it seems to be stable and this has also been verified by the vendor. It should be possible to use FRAM within couple of hours.

UPDATE 2020-07-03 17:10 : The FS is up and running but we have decided to keep the machine closed during the weekend so we are sure everything works as it should on Monday. The reason for many recent FRAM downtimes have been caused by storage hardware faults. We are investigating the issue together with the storage vendor.

UPDATE 2020-07-02 13:20 : FRAM is off-line, we are investigating the issues. The machine will probably stay off-line until tomorrow.

UPDATE 2020-07-02 12:10 : Whole file system is still very unstable, we will most likely have to take FRAM down, Slurm reservation created and all users might be kicked out soon.

UPDATE 2020-07-02 11:15 : Whole file system is still very unstable and we are trying to fix the problem.

Metacenter Operations

Fram: Lustre quota problem.

Dear Fram users,
We still have lustre quota problem on Fram cluster where “dusage” command may give you inaccurate numbers.
To eliminate this issue we need downtime which will take about 4 hours.

Date for downtime is not decided, we will give you an update as soon as we have more information.

Meanwhile if you have any problem related to quota on Fram please contact us.

Downtime 20th – 24th of April is over. Services are back in production

All services on Fram and NIRD are now be back in production, except for slurmbrowser and desktop.fram.sigma2.no.

Here is a list of what has been done during the last four days:

  • Firmware upgrade on NIRD in Trondheim and Tromsø
  • Firmware upgrade on NIRD Toolkit
  • Firmware upgrade on Fram storage and Fram nodes, switches m.m
  • Software/OS upgrade on NIRD Trondheim and Tromsø
  • Software/OS upgrade on NIRD Toolkit
  • Software/OS upgrade on Fram nodes

In total, including vendors, ca 15 people were involved in the upgrade.

We thank you for your patience.

tos-project3 on NIRD is read only

Due to underlying hardware issues, tos-project3 filesystem is set to READ-ONLY while we investigate the issue.

These are the projects affected:

NN9999K
NS1002K
NS4704K
NS9001K
NS9012K
NS9014K
NS9033K
NS9054K
NS9063K
NS9066K
NS9114K
NS9191K
NS9320K
NS9404K
NS9518K
NS9602K
NS9615K
NS9641K
NS9672K
NS0000K
NS1004K
NS9000K
NS9003K
NS9013K
NS9021K
NS9035K
NS9060K
NS9064K
NS9081K
NS9133K
NS9305K
NS9357K
NS9478K
NS9560K
NS9603K
NS9616K
NS9655K
NS9999K