Change to the “optimist” jobs

The requirements for specifying optimist jobs has changed. It is now required to also specify –time. (Previously, this was not needed nor allowed.) The documentation will be updated momentarily.

(The reason for the change is that we discovered that optimist jobs often would not start properly without the –time specification. This has not been discovered earlier because so few projects were using optimist jobs.)

UPDATED: Problem with Access to Projects

Quite a few users have lost access to their project(s) on Nird and all clusters during the weekend. This was due to a bug in the user administration software. The bug has been identified, and we are working on rolling back the changes.

We will update this page when access has been restored.

Update 12:30: Problem resolved. Project access has now been restored. If you still have problems, please contact support at sigma@uninett.no

Update: This applies to all systems, not only Fram and Saga.

Fram queue system adjustment

As part of the ongoing work of tuning and improving the queue system on Fram for best resource usage and user experience, we recently upgraded Slurm to version 18.08.

This has enabled us to replace the limit of how many jobs per user the system will try to backfill with a more flexible setting.

Jobs are started and backfilled in priority order, and the priority of jobs increase with time as they are pending. With the new setting, only a fixed number of each user’s jobs within a project will increase in priority. The number is currently 10, but will probably be adjusted over time. This setting will make it easier for all users and projects to get jobs started when one user has submitted very many jobs over a short time, while at the same time not preventing jobs from starting when there are free resources.

Note that the setting is per user and project, so if more than one user submit jobs in the same project, each of them will get 10 jobs with priority increasing, and similarly, if one user submits jobs to several projects, the user will get 10 jobs with priority increasing in each project.

The setting is documented here.

Slurm Upgrade on Fram

Update: The upgrade is now done. All seems to have gone well, but it can be a good idea to check your jobs.

We will upgrade the queue system (Slurm) on Fram at 11:00 today, from version 17.11 to 18.08. The upgrade is expected to take 5-10 minutes. During that time, queue system commands (squeue, sbatch, etc.) will not work, but running jobs should not be affected.

Emergency Stop of Fram, NIRD and the Service Platform

Update:

  • 2018-11-07 16:36: We will have to upgrade firmware on all the storage enclosures in NIRD and rebuild the failed volumes. Will keep you updated and reopen access to NIRD and Service Platform as soon as emergency maintenance is ready.
  • 2018-11-07 14:53: User home directories were migrated over to /cluster/home and Fram is starting back again. We will soon re-open access to Fram. Please note that NIRD project areas will _not_ be available until NIRD is up again.

 

Due to disk failures on NIRD, we have to shut down Fram, NIRD and the Service Platform immediately to avoid losing user data.  This means stopping all jobs and user processes, and logging users out of the systems.

We will try to copy the home directories from NIRD to Fram to be able to start up Fram again without needing to mount NIRD. If this is successful, we will be able to start up Fram again, hopefully later today.  (Note that the NIRD project areas will _not_ be available until NIRD is up again.)

We will update this post with more information when we know more.

[RESOLVED] Short file system hang on /nird/home on Fram

Update: There was a second hang at around 09. The reason for the hangs has been found and fixed.

There was a short hang on the /nird/home ($HOME) file system on Fram from 08:00 to 08:55 today. The file system is back to normal now. We are investigating the reason for it.

Jobs running in $SCRATCH, $USERWORK or in a project directory have most likely not been affected, but it is probably a good idea to check the status of your jobs.