Saga – poor file system performance

The parallel file system on Saga is currently under a lot of stress caused by the running jobs.

We are working on optimizing and speed up the file system together with the vendor.
In the mean time we kindly ask you to follow the guidlines listed at our documentation pages.

As a general rule:

  • file system performance decreases as the file system usage grows
  • the number of I/O operations is directly influencing the responsiveness of the file system
  • disk operations are with factor of thousand more expensive than memory operations
  • the higher the number of files, the slower the I/O is

Thank you for your understanding!

Metacenter Operations

$USERWORK auto-cleanup on Saga

Dear Saga User,

The usage of the /cluster file system on Saga has now bypassed 60%. To maintain the file system as responsive as possible, we have to periodically decrease the number of files, free up space and enforce automatic deletion of temporary files.

Starting with Wednesday, 19th of February we are going to activate the automatic cleanup of the $USERWORK (/cluster/work) area as documented here.

The retention period is:

  • 42 days below 70% file system usage
  • 21 days when file system usage reaches 70%.

Files older then the active retention period will be automatically deleted.
You can read more information about the storage areas on HPC clusters here and here.

Please copy all your important data from $USERWORK to your project area to avoid data loss.

Thank you for your understanding!

Metacenter Operations

NIRD project file systems mounted on Saga

Dear Saga User,

We have the pleasure to announce that we have now fixed all the technical requirements and mounted NIRD project file systems on Saga login nodes.

You may find your projects in the

/nird/projects/nird

folder.

Please note that to transfer of large amount of files is sluggish and has a big impact on the I/O performance. It is always better to transfer one larger file than many small files.
As an example, transfer of a folder with 70k entries and about 872MB took 18 minutes, while the same files archived into a single 904MB file took 3 seconds.

You can read more about the tar archiving command by reading the manual pages. Type

man tar

in your Saga terminal.

Metacenter Operations

Reorganized NIRD storage

Dear NIRD User,

During the last maintenance we have reorganized the NIRD storage.

Projects have now a so-called primary site which is either Tromsø or Trondheim. Previously we had single primary site, Tromsø. This change had to be introduced to prepare coupling NIRD storage with Saga and the upcoming Betzy HPC clusters.

While we are working on a final, seamless access solution regardless of the primary site for your data, please use the following temporary solution:


To work closest to your data you have to connect to the login nodes located at the primary site of your project:

  • for Tromsø the address is unchanged and is login.nird.sigma2.no
  • for Trondhein the address is login-trd.nird.sigma2.no

To find out the primary site of your project log in on a login node and type:

readlink /projects/NSxxxxK

It will print out a path starting either with /tos-project or /trd-project.
If it starts with “tos” then use login.nird.sigma2.no.
If it starts with “trd” then use login-trd.nird.sigma2.no.

Metacenter Operations

Network outage

Update

  • 2020-01-13 14:54: Problems have been sorted out now and network is functional again.
  • 2020-01-13 14:40: Problems are unfortunately back again. Uninett’s network specialists are working on solving the problem as soon as possible.
  • 2020-01-13 14:22: Network is functional again. Apologies for the inconvenience it has caused.

We are currently experiencing network outage on Saga and some parts of NIRD. The problem is under investigation.

Please check back here for an update on this matter.

Metacenter Operations

NIRD and NIRD Toolkit scheduled maintenance

Update:

  • 2020-01-23 17:30: Services are now progressively restarted.
  • 2020-01-22 21:49: We have detected file system level corruption and to avoid data corruption we had to unmount and rescan all the file systems (about 18PB) on NIRD.
    We are currently working on starting back the services on NIRD Toolkit.
  • 2020-01-22 11:11: Software and firmware is now upgraded on NIRD Toolkit.
    Most of the fileset changes are also carried out. We are currently working on the last bits. Will keep you updated.
  • 2020-01-20 08:58: Maintenance has started. NIRD file systems are unmounted from Fram until maintenance is finished.

Dear NIRD and NIRD Toolkit User,

We will have a three day long scheduled maintenance on NIRD and NIRD Toolkit starting on the 20th of January, 09:00 AM.

During the maintenance we will:

  • carry out software and firmware updates,
  • change geo-locality for some of the projects,
  • replace synchronization mechanisms,
  • depending on part delivery times from disk vendor – expand the storage and quotas.

Files stored on NIRD will be unavailable during the time of the maintenance and therefore so will be the services. This will of course affect the NIRD file systems available on Fram too.

Please note that backups taken from the Fram and Saga HPC clusters will also be affected and will be unavailable during this period.

Please accept out apologies for the inconvenience this downtime is causing.

Metacenter Operations

Backup policy changes

Dear Fram and Saga User,

As you may remember quotas have been enforced on $HOME areas during October past year. This has been carried out only for users having less then 20GiB in their Fram or Saga home folders.

Because of repeated space related issues on NIRD home folders and due to the backups taken from either Fram or Saga, we had to change our backup policies and exclude backup for users using more then the 20GiB in their Fram or Saga $HOME areas.

If you manage to clean up your $HOME on Fram and/or Saga and decrease your $HOME usage below 20GiB, we can then enforce quotas on it and re-enable the backup.

Thank you for your understanding!

Metacenter Operations

Quota enforcement on Fram

Dear Fram User,

We have fixed the broken quota indexes on Fram /cluster file system.
Due to heavy disk usage on both Fram and NIRD, we need to enforce quotas on all relevant areas:

  • /cluster/home
  • /cluster/projects
  • /cluster/shared

To be able to control disk usage on the areas mentioned above, group ownerships are enforced nightly.

To avoid job crashes prior to starting jobs, please make sure that unix users and groups you are a member of have enough free quota, be it block or inode quotas.

This Wikipedia page gives good explanation about the block and inode quota types.

To check quotas on Fram, you may use the dusage command. i.e.

# list all user and groups quotas
dusage -a
# for help use
dusage -h

Thank you for your understanding!
Metacenter Operations

Planned maintenance on Fram on 28.11.2019

Update:

  • 2019-11-28 15:00: Fram is opened for production and scheduled jobs were resumed.
  • 2019-11-28 14:33: For more information about enforced file system quota accounting, please see this announcement.
  • 2019-11-28 13:26: Lustre quota indexes are fixed now. Running some additional cleanup and tests.
  • 2019-11-28 08:03: Maintenance has started.

Dear Fram User,

The slave quota indexes on the /cluster file system got broken and as a result of it the quota accounting has become unpredictable. For the same reason some of you may have even experienced job crashes.

To fix this, we will have a two days planned downtime starting from 08:00AM on the 28th of November.

Fram jobs which can not finish by the 28th of November, are queued up and will not start until the maintenance is finished.

Thank you for your consideration!

Metacenter Operations

Fram down

 

2019-11-13-16:15 Fram is up and running again. 

One of the cooling units stoped, causing the other to also stop and all compute nodes went down. 

 

Dear Fram User,

Fram is currently down likely due to issues with the cooling distribution unit.
We are currently investigating the issue and working on placing Fram back into production.

Apologies for the inconvenience!

Metacenter Operations