Betzy down

[UPDATE, 2022-05-11:00]: Yesterday’s loss of power was due to a major power outage in the city of Trondheim.

[UPDATE, 2022-05-12 10:25] Most nodes are now up and running as normal.

[UPDATE, 2022-05-12 08:50]: There was a power outage on Betzy at around 23:30 last night, which made all compute nodes go down. We are working on getting the nodes up and back into production now.

It appears that most or all of Betzy is down right now. We are investigating.

Maintenance Stops on Saga, Fram and Betzy

[Update, 2022-04-30 11:10] The Fram and Saga maintenance is now over, and jobs are running again.

[Update, 2022-04-29 08:00] The Fram and Saga maintenances have now started.

[Update, 2022-04-28 12:56] The Betzy maintenance is now over, and jobs are starting again.

[Update, 2022-04-28 08:00] The Betzy maintenance has now started.

There will unfortunately be maintenance stops on all NRIS clusters next week, for an important security update. The maintenance stops will be

  • Betzy: Thursday, April 28. at 08:00
  • Fram and Saga: Friday, April 29. at 08:00

We expect the stops will last a couple of hours. We have set up maintenance reservations on all nodes on the clusters, so jobs that would have run into the reservation will be left pending in the job queue until after the maintenance stop.

We are sorry for the inconvenience this creates. We had hoped to be able to apply the security update with jobs running, but that turned out not to be possible.

Betzy: Corrected GPU node config

The queue system configuration of the GPU nodes on Betzy had an error: The number of CPUs were set to 128 instead of 64. Most jobs would probably not be affected by this, but it is possible that some jobs got sub-optimal cpu pinnings.

This has now been fixed, and the documentation updated. There is nothing users have to do with their job scripts (except if they asked for more than 64 cpus per node).

Downtime on Saga and Betzy, Thursday February 3.

There will be a short maintenance stop of Saga and Betzy on Thursday, Feburary 3. at 15:00 CET, due to work on the cooling system in the data hall. The downtime is planned to last for three hours.

During the downtime, no jobs will run, but the login nodes and the /cluster file system will be up. Jobs that cannot finish before 15:00 at February 3, will be left pending in the queue until after the stop.