Long jobs

This page describes how suitable applications requiring unusually long execution times can be run on CSD3.

Definition of long jobs

We define long jobs as jobs requiring wall times (i.e. real execution times) of up to 7 days. Continuous execution times of these lengths are normally disallowed by both non-paying and paying qualities of service (QoS), in order to achieve reasonable overall throughput.

In general it is advisable for any application running for extended periods to have the ability to save its progress (i.e. to checkpoint) as insurance against unexpected failures that may result in wastage of significant resources. Applications for which it is possible to checkpoint are largely immune from per-job runtime limits as they can simply resume from the most recent checkpoint in the guise of a new job. Applications for which it is not feasible to checkpoint may find the scheduling features described below to be of use.

Access to the long job QoS is not given by default; if you have a need to submit such jobs please contact support describing your case.

Note on checkpointing

Note that CSD3 has three modules for DMTCP installed. DMTCP creates checkpoints that allow the resume of an application in case it failed to complete in a specified time period (i.e. maximum of 12 hours for SL-3 and 36 hours for SL-2). These checkpoints are snapshots of a running application, which includes all the memory structures and optionally files created by the application. DMTCP was created for applications that don’t have mid-run stop and resume capabilities. Using a checkpoint DMTCP can resume the application and from the application’s “point of view” it seems that it was never stopped. Some jobs may be able to use DMTCP to accumulate extended run times without needing to request the special QoS described below.

Currently, we have tested DMTCP for single node applications and at the time of writing it can be used explicitly within a job script. The installed DMTCP modules are dmtcp/2.3.1, dmtcp/2.6.0 and dmtcp/2.6.0-intel-17.0.4. The latter module is to be used when any Intel module is loaded. An example submission script that demonstrates the use of DMTCP is the following:

#!/bin/bash
#SBATCH -J dmtcp_example
#SBATCH -A CHANGEME-SL3
#SBATCH --output=dmtcp_example.out
#SBATCH --error=dmtcp_example.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH -p skylake

. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load dmtcp/2.6.0-intel-17.0.4
ulimit -s 8192

RESTARTSCRIPT="dmtcp_restart_script.sh"
export DMTCP_QUIET=2

runcmd=./example
tint=5

echo "Start coordinator"
date
eval "dmtcp_coordinator --daemon --exit-after-ckpt --exit-on-last -i "$tint" --port-file cport.txt -p 0"
sleep 2
cport=$(<cport.txt)
echo "$cport"

if [ -f "$RESTARTSCRIPT" ]
then
    echo "Resume executable"
    CMD="dmtcp_restart --ckptdir ckpoints -p "$cport" ckpoints/ckpt_*.dmtcp"
    echo $CMD
    eval $CMD
else
    echo "Start new executable"
    CMD="dmtcp_launch --ckptdir ckpoints --ckpt-open-files -p "$cport" "$runcmd
    echo $CMD
    eval $CMD
fi

echo "Stopped program execution"
date
sleep 2
CMD="dmtcp_command --quit --port "$cport""

The above submission script executes the C++ application example (source code shown below) which prints to the screen and stores in a text file (i.e. example.txt) the value of an integer variable every one second. The value increases by one each second. In the above submit script DMTCP is set to create a checkpoint 5 seconds after the start of the application. Moreover, DMTCP is set to store the file example.txt, which is generated by the application example, as part of the checkpoint. After the interruption of the submitted job it may be resumed by deleting the example.txt file and resubmitting the job. The deletion of the example.txt file is required because DMTCP creates a copy as part of the checkpoint. Moreover, the command ulimit -s 8192 in the submit script is required by DMTCP to work. More information about DMTCP can be found at the official website http://dmtcp.sourceforge.net/docs/index.html.

#include <iostream>
#include <chrono>
#include <fstream>
#include <thread>

int main(int argc, char *argv[])
{
    std::ofstream myfile;
    myfile.open ("example.txt");

    for (int i=0; i<500; i++)
    {
        std::this_thread::sleep_for(std::chrono::seconds(1));

        std::cout << "Count: " << i << std::endl;
        myfile << i << std::endl;
    }

    myfile.close();
}

The QOSL QoS

Paying users with suitable applications may be granted access to the QOSL quality of service, which permits jobs running for up to 7 days. Jobs associated with this special QoS are confined to -long variants of the usual partitions (skylake-long, knl-long, pascal-long). It is expected that users wishing to run for a long time are prepared to let others do so too while waiting for a start time.

QOSL is implemented by three SLURM QoS definitions, one per cluster. Peta4-Skylake has cpul, which is restricted to 640 cpus per user. Peta4-KNL has knll, which is restricted to 64 nodes per user. Wilkes2-GPU has gpul, which is restricted to 32 GPUs per user.

In order to apply for access to QOSL, please email support@hpc.cam.ac.uk detailing why this mode of usage is necessary, and explaining why checkpointing is not a practical option for your application.

Submitting long jobs

Use of QOSL is tied to the -long partitions, therefore once given access it is necessary only to specify this partition - e.g.

sbatch -t 7-0:0:0 -p skylake-long -A YOUR_PROJECT ...