Betzy pre-production

Dear HPC User,

We are pleased to announce that Betzy is opened for pre-production Friday 20 November.

Being close to the weekend, Betzy is opened stepwise. First to prior pilot projects and then for general access Tuesday 24 November.

It has been a long journey, but we are happy to see good performance and stability on the system.

Please note, that during the coming days, changes will be made to the queue system setup, which could necessitate the cancelling of running jobs.

Finally, support will be also offered only from 24 November.

Thank you for your patience and we wish you happy computing!

Best regards,

Lorand Szentannai, on behalf of the preparations team

Updated information about Betzy production

Dear HPC User,

As mentioned previous week, the validation benchmarks have been stable, and we were ready to run and evaluate the site acceptance test. Unfortunately, the interconnect stability issues reoccured once again. 

We and the vendor have been running extensive tests since. The R&D department from the vendor of the interconnect released a new firmware yesterday afternoon, which was applied already yesterday evening and stress-tests immediately started. In order to be sure that the problem is resolved, several days of testing is needed.

Therefore, we have to postpone the production yet again with a week. Current production estimate is end of week 47.

We can assure you that we are very eager to have the system 100% stabilized and in production and everybody involved in the project (be it from Sigma2, the Metacenter, or vendor) is working intensively with this.

Thank you for your understanding!

Best regards,

Lorand Szentannai, on behalf of the preparations team

Information regarding Betzy production

Dear HPC User,

Our previous estimate of production on Betzy has proved to be somewhat optimistic. 

With the help of the vendor, we believe we have identified and fixed the cause of the interconnect stability problem on Betzy. The most recent validation benchmarks have been stable, and we will begin the site acceptance test (SAT) within Friday, 6 November. If the machine passes the SAT, it will be handed over to the operations and opened for production. 

The final preparations usually take 1-3 days. We therefore estimate that production will begin on Betzy within next week, week 46.

Best regards,

Lorand Szentannai, on behalf of the preparations team

Estimated production date for Betzy

Dear HPC user,

Our newest supercomputer – Betzy – is unfortunately delayed entering production due to circumstances outside of our control. 

We have had significant delays in getting all the components in place due to slack in logistics caused by the Covid pandemic. However, approximately 94% of the system capacity is now ready installed and configured. Work is ongoing to prepare the outstanding system capacity in the upcoming weeks. 

Benchmarks and pilot testing on Betzy have revealed an intermittent stability problem with the node interconnect. The vendor has been investigating the issue in the past two weeks in order to identify the source of the issue. Our new best estimate is that Betzy go into production in week 45

This has consequences for the decommissioning of Vilje and Stallo because we rely on Betzy to free up computational load from the other machines. Thus, the new decommissioning date for Vilje and Stallo is 1. DecemberWe would like the machines to be fully utilized until they are decommissioned, and therefore encourage you to continue using Vilje and Stallo if you still have the opportunity.

Thank you for your understanding!

Best regards,
Lorand Szentannai, on behalf of the preparations team

Betzy access closed, preparing for production

UPDATE:

  • 08.10.2020: After extensive testing, the vendor found stability issues are unfortunately still present. The problem is escalated and under investigation. We will get back to you with more information as soon as we get an update from the vendor.
  • 30.09.2020: The vendor will carry out firmware updates on Betzy during today and as a consequence we need to stop running jobs and run tests to make sure the system is table.
    Access to the machine will be reopened as soon as we are ready with the tests. Please follow the progress here, on OpsLog.
  • 25.09.2020: We are temporarily reopening the access over the weekend in order to allow further testing on the machine.
    Further work is expected to be done by the vendor sometime next week and as a consequence, jobs will be terminated again and access closed while maintenance will be ongoing.

Dear Betzy pilots,

We are pleased to announce that despite logistics challenges caused by Covid-19, most of the outstanding issues were sorted out. This unusual situation requested a more dynamic approach from everyone involved, while putting pressure on the communication due to uncertainties and quick situation changes. Because of this, setting and advertising a production date proved to be difficult.

We can now start aiming for setting Betzy into production in the beginning of October. Before we can conclude, and proceed with the preparations, we need to re-run several comprehensive tests.

Therefore, we will have to stop all jobs and access to Betzy starting from tomorrow, 17 September 2020 10AM. Access to Betzy will be re-established as soon as all the tests are effectuated. Please be prepared for a more extensive maintenance this time, which might require up to two and half weeks.

The file system on Betzy is not going to be reformatted. That is, your data will not be removed intentionally. However, we can not guarantee data integrity until backups are taken and the machine is placed into production. Therefore, we strongly advise you to take a backup of your important data for the sake of security.

Apologies for the short notice and the inconvenience this is causing to you.

Best regards,

Lorand Szentannai, on behalf of the preparations team

Saga – poor file system performance

The parallel file system on Saga is currently under a lot of stress caused by the running jobs.

We are working on optimizing and speed up the file system together with the vendor.
In the mean time we kindly ask you to follow the guidlines listed at our documentation pages.

As a general rule:

  • file system performance decreases as the file system usage grows
  • the number of I/O operations is directly influencing the responsiveness of the file system
  • disk operations are with factor of thousand more expensive than memory operations
  • the higher the number of files, the slower the I/O is

Thank you for your understanding!

Metacenter Operations

$USERWORK auto-cleanup on Saga

Dear Saga User,

The usage of the /cluster file system on Saga has now bypassed 60%. To maintain the file system as responsive as possible, we have to periodically decrease the number of files, free up space and enforce automatic deletion of temporary files.

Starting with Wednesday, 19th of February we are going to activate the automatic cleanup of the $USERWORK (/cluster/work) area as documented here.

The retention period is:

  • 42 days below 70% file system usage
  • 21 days when file system usage reaches 70%.

Files older then the active retention period will be automatically deleted.
You can read more information about the storage areas on HPC clusters here and here.

Please copy all your important data from $USERWORK to your project area to avoid data loss.

Thank you for your understanding!

Metacenter Operations

NIRD project file systems mounted on Saga

Dear Saga User,

We have the pleasure to announce that we have now fixed all the technical requirements and mounted NIRD project file systems on Saga login nodes.

You may find your projects in the

/nird/projects/nird

folder.

Please note that to transfer of large amount of files is sluggish and has a big impact on the I/O performance. It is always better to transfer one larger file than many small files.
As an example, transfer of a folder with 70k entries and about 872MB took 18 minutes, while the same files archived into a single 904MB file took 3 seconds.

You can read more about the tar archiving command by reading the manual pages. Type

man tar

in your Saga terminal.

Metacenter Operations

Reorganized NIRD storage

Dear NIRD User,

During the last maintenance we have reorganized the NIRD storage.

Projects have now a so-called primary site which is either Tromsø or Trondheim. Previously we had single primary site, Tromsø. This change had to be introduced to prepare coupling NIRD storage with Saga and the upcoming Betzy HPC clusters.

While we are working on a final, seamless access solution regardless of the primary site for your data, please use the following temporary solution:


To work closest to your data you have to connect to the login nodes located at the primary site of your project:

  • for Tromsø the address is unchanged and is login.nird.sigma2.no
  • for Trondhein the address is login-trd.nird.sigma2.no

To find out the primary site of your project log in on a login node and type:

readlink /projects/NSxxxxK

It will print out a path starting either with /tos-project or /trd-project.
If it starts with “tos” then use login.nird.sigma2.no.
If it starts with “trd” then use login-trd.nird.sigma2.no.

Metacenter Operations

Network outage

Update

  • 2020-01-13 14:54: Problems have been sorted out now and network is functional again.
  • 2020-01-13 14:40: Problems are unfortunately back again. Uninett’s network specialists are working on solving the problem as soon as possible.
  • 2020-01-13 14:22: Network is functional again. Apologies for the inconvenience it has caused.

We are currently experiencing network outage on Saga and some parts of NIRD. The problem is under investigation.

Please check back here for an update on this matter.

Metacenter Operations