This page describes how suitable applications requiring unusually long execution times can be run on CSD3.
Definition of long jobs¶
We define long jobs as jobs requiring wall times (i.e. real execution times) of up to 7 days. Continuous execution times of these lengths are normally disallowed by both non-paying and paying qualities of service (QoS), in order to achieve reasonable overall throughput.
In general it is advisable for any application running for extended periods to have the ability to save its progress (i.e. to checkpoint) as insurance against unexpected failures that may result in wastage of significant resources. Applications for which it is possible to checkpoint are largely immune from per-job runtime limits as they can simply resume from the most recent checkpoint in the guise of a new job. Applications for which it is not feasible to checkpoint may find the scheduling features described below to be of use.
Access to the long job QoS is not given by default; if you have a need to submit such jobs please contact support describing your case.
The QOSL QoS¶
Paying users with suitable applications may be granted access to the QOSL quality of service, which permits jobs running for up to 7 days. Jobs associated with this special QoS are confined to
-long variants of the usual partitions (
pascal-long). It is expected that users wishing to run for a long time are prepared to let others do so too while waiting for a start time.
QOSL is implemented by three SLURM QoS definitions, one per cluster. Peta4-Skylake has
cpul, which is restricted to 640 cpus per user. Peta4-KNL has
knll, which is restricted to 64 nodes per user. Wilkes2-GPU has
gpul, which is restricted to 32 GPUs per user.
In order to apply for access to QOSL, please email email@example.com detailing why this mode of usage is necessary, and explaining why checkpointing is not a practical option for your application.
Submitting long jobs¶
Use of QOSL is tied to the -long partitions, therefore once given access it is necessary only to specify this partition - e.g.
sbatch -t 7-0:0:0 -p skylake-long -A YOUR_PROJECT ...