login-2 on Saga is down

login-2,saga.sigma2.no is currently down duee two several faulty memory dimms. This also affects the use and functionality slurm browser and desktop.saga.sigma2.no

We hope to have the dimms replaced some time during the week.

Maintenance on NIRD, NIRD Toolkit, Fram and Saga, 20th April -24th April

Dear NIRD, NIRD Toolkit, Fram and Saga User,

We will have a four day long scheduled maintenance on NIRD, NIRD Toolkit, Fram and Saga starting on the 20th of April, 09:00 AM.

During the maintenance we will:

  • carry out software and firmware updates on all systems

Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram and Saga too.

Login services to NIRD, NIRD-toolkit, Fram and Saga will be disabled during the maintenance

Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.

Please accept our apologies for the inconvenience this downtime is causing.

Metacenter Operations

Saga – poor file system performance

The parallel file system on Saga is currently under a lot of stress caused by the running jobs.

We are working on optimizing and speed up the file system together with the vendor.
In the mean time we kindly ask you to follow the guidlines listed at our documentation pages.

As a general rule:

  • file system performance decreases as the file system usage grows
  • the number of I/O operations is directly influencing the responsiveness of the file system
  • disk operations are with factor of thousand more expensive than memory operations
  • the higher the number of files, the slower the I/O is

Thank you for your understanding!

Metacenter Operations

FRAM – critical storage issue

UPDATE:

  • 2020-03-12 10:45: Maintenance is finished now and faulty components were replaced. We continue to monitor the storage system.
    Thank you for your understanding.
  • 2020-03-11 10:16: We have to replace one hardware module on the Fram storage system. The maintenance will be carried out keeping the system online. However there will be some short, up to 5 minutes, hiccup while we are failing over components on the redundant path, possibly causing some jobs to crash.
  • 2020-03-05 20:30: Maintenance is over, Fram is online. Jobs that were running before the maintenance may have been re-queued. It’s also possible that some of the jobs were killed, we are sorry for that. if this is the case, you have to resubmit your job.

Dear FRAM users,

We are facing a major issue with FRAM’s storage system. The necessary tasks are being performed to mitigate the issue. We will have to take the whole machine offline to be able to perform the above mentioned tasks.

FRAM – file system issue

Dear FRAM user,
We are facing some minor issue(s) with FRAM’s file system. The necessary tasks are being performed to mitigate the issue.

The above mentioned should not cause any downtime.

UPDATE: 28.02. / 13:20: we have lost the filesystem for couple of minutes, please check your jobs and get back to us in case of any problem(s) …

Thank you for your patience and understanding

NIRD: file system problems

Dear NIRD user,
We have had serious problems with the GPFS file systems this afternoon and had to stop the storage and all the services.

The NIRD storage and the NIRD-toolkit are now back online.
Please notify the metacenter support if you notice any remaining issues.

We are very sorry for the inconvenience.

Update 10:00 24.02.2020 :  We still have problem with Nird mount points on Fram, we are working on the problem, we will keep users posted here.

Update 10:45 24.02.2020:  Problem with Nird mount point on Fram is resolved.

Fram: Interconnect network manager crashed

Dear Fram users:

Fram interconnect network manager crashed yesterday at 15:34, which caused all compute nodes had degraded routing information. This can cause Slurm jobs crash with a communication error.
Interconnect network manager is running again, and all compute nodes have the latest routing information, and communication between the compute nodes are restored.
We apologize for the inconvenience if you have any question please don’t hesitate to contact support.

$USERWORK auto-cleanup on Saga

Dear Saga User,

The usage of the /cluster file system on Saga has now bypassed 60%. To maintain the file system as responsive as possible, we have to periodically decrease the number of files, free up space and enforce automatic deletion of temporary files.

Starting with Wednesday, 19th of February we are going to activate the automatic cleanup of the $USERWORK (/cluster/work) area as documented here.

The retention period is:

  • 42 days below 70% file system usage
  • 21 days when file system usage reaches 70%.

Files older then the active retention period will be automatically deleted.
You can read more information about the storage areas on HPC clusters here and here.

Please copy all your important data from $USERWORK to your project area to avoid data loss.

Thank you for your understanding!

Metacenter Operations

NIRD project file systems mounted on Saga

Dear Saga User,

We have the pleasure to announce that we have now fixed all the technical requirements and mounted NIRD project file systems on Saga login nodes.

You may find your projects in the

/nird/projects/nird

folder.

Please note that to transfer of large amount of files is sluggish and has a big impact on the I/O performance. It is always better to transfer one larger file than many small files.
As an example, transfer of a folder with 70k entries and about 872MB took 18 minutes, while the same files archived into a single 904MB file took 3 seconds.

You can read more about the tar archiving command by reading the manual pages. Type

man tar

in your Saga terminal.

Metacenter Operations