Long jobs

This page describes how suitable applications requiring unusually long execution times can be run on CSD3.

Definition of long jobs

We define long jobs as jobs requiring wall times (i.e. real execution times) of up to 7 days. Continuous execution times of these lengths are normally disallowed by both non-paying and paying qualities of service (QoS), in order to achieve reasonable overall throughput.

In general it is advisable for any application running for extended periods to have the ability to save its progress (i.e. to checkpoint) as insurance against unexpected failures that may result in wastage of significant resources. Applications for which it is possible to checkpoint are largely immune from per-job runtime limits as they can simply resume from the most recent checkpoint in the guise of a new job. Applications for which it is not feasible to checkpoint may find the scheduling features described below to be of use.

Access to the long job QoS is not given by default; if you have a need to submit such jobs please contact support describing your case.

Checkpointing with DMTCP

Note that CSD3 has three modules for DMTCP installed. DMTCP creates checkpoints that allow the resume of an application in case it failed to complete in a specified time period (i.e. maximum of 12 hours for SL-3 and 36 hours for SL-2). These checkpoints are snapshots of a running application, which includes all the memory structures and optionally files created by the application. DMTCP was created for applications that don’t have mid-run stop and resume capabilities. Using a checkpoint DMTCP can resume the application and from the application’s “point of view” it seems that it was never stopped. Some jobs may be able to use DMTCP to accumulate extended run times without needing to request the special QoS described below.

DMTCP can be used on CSD3 explicitly within a job script. We tested DMTCP for single and multinode applications, however, we cannot guarantee that it will work for all applications. The user should do tests and use the appropriate DMTCP commands and options for his/hers application.

The installed DMTCP modules are dmtcp/2.3.1, dmtcp/2.6.0 and dmtcp/2.6.0-intel-17.0.4. The latter module is to be used when any Intel module is loaded. An example submission script that demonstrates the use of DMTCP is the following:

#!/bin/bash
#SBATCH -J dmtcp_example
#SBATCH -A CHANGEME-SL3
#SBATCH --output=dmtcp_example.out
#SBATCH --error=dmtcp_example.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH -p skylake

. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load dmtcp/2.6.0-intel-17.0.4
ulimit -s 8192

RESTARTSCRIPT="dmtcp_restart_script.sh"
export DMTCP_QUIET=2

runcmd=./example
tint=5

echo "Start coordinator"
date
eval "dmtcp_coordinator --daemon --coord-logfile dmtcp_log.txt --exit-after-ckpt --exit-on-last -i "$tint" --port-file cport.txt -p 0"
sleep 2
cport=$(<cport.txt)
echo "$cport"

if [ -f "$RESTARTSCRIPT" ]
then
    echo "Resume the application"
    CMD="dmtcp_restart -p "$cport" -i "$tint" ckpt*.dmtcp"
    echo $CMD
    eval $CMD
else
    echo "Start the application"
    CMD="dmtcp_launch --rm --infiniband --no-gzip -h localhost -p "$cport" "$runcmd
    echo $CMD
    eval $CMD
fi

echo "Stopped program execution"
date
sleep 2
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit

The above submission script executes the C++ application example (source code shown below) which prints to the screen and stores in a text file (i.e. example.txt) the value of an integer variable every one second. The value increases by one each second. In the above submit script DMTCP is set to create a checkpoint 5 seconds after the start of the application.

After the interruption of the submitted job it may be resumed by resubmitting the job using the same submission script. Moreover, the command ulimit -s 8192 in the submit script is required by DMTCP to work.

#include <iostream>
#include <chrono>
#include <fstream>
#include <thread>

int main(int argc, char *argv[])
{
    std::ofstream myfile;
    myfile.open ("example.txt");

    for (int i=0; i<500; i++)
    {
        std::this_thread::sleep_for(std::chrono::seconds(1));

        std::cout << "Count: " << i << std::endl;
        myfile << i << std::endl;
    }

    myfile.close();
}

DMTCP has a wide range of parameters that could have different effects to different types of applications. The user should experiment with their specific application to make sure that checkpoints are created successfully.

For example, applications that write files might need the option `--ckpt-open-files`. This option stores the generated files as part of the checkpoint. The deletion of these files is required because DMTCP creates a copy as part of the checkpoint.

An example submit script for multiple nodes is below:

#!/bin/bash
#SBATCH -J dmtcp_example
#SBATCH -A CHANGEME-SL3
#SBATCH --output=dmtcp_example.out
#SBATCH --error=dmtcp_example.err
#SBATCH --nodes=8
#SBATCH --ntasks=256
#SBATCH --time=00:01:00
#SBATCH -p skylake

. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load dmtcp/2.6.0-intel-17.0.4
ulimit -s 8192

RESTARTSCRIPT="dmtcp_restart_script.sh"
export DMTCP_QUIET=2

runcmd=./example
tint=5

echo "Start coordinator"
date
eval "dmtcp_coordinator --daemon --coord-logfile dmtcp_log.txt --exit-after-ckpt --exit-on-last -i "$tint" --port-file cport.txt -p 0"
sleep 2
cport=$(<cport.txt)
echo $cport
h=`hostname`
echo $h

export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$cport

if [ -f "$RESTARTSCRIPT" ]
then
    echo "Resume the application"
    ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT
else
    echo "Start the application"
    mpirun -env I_MPI_FABRICS tcp -ppn 32 -np ${SLURM_NTASKS} dmtcp_launch --ckpt-open-files --infiniband --no-gzip --rm $runcmd
fi

echo "Stopped program execution"
date
sleep 2
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit

The user is encouraged to seek more information about DMTCP at the official website http://dmtcp.sourceforge.net/docs/index.html.

The QOSL QoS

Paying users with suitable applications may be granted access to the QOSL quality of service, which permits jobs running for up to 7 days. Jobs associated with this special QoS are confined to -long variants of the usual partitions (skylake-long, knl-long, pascal-long). It is expected that users wishing to run for a long time are prepared to let others do so too while waiting for a start time.

QOSL is implemented by three SLURM QoS definitions, one per cluster. Peta4-Skylake has cpul, which is restricted to 640 cpus per user. Peta4-KNL has knll, which is restricted to 64 nodes per user. Wilkes2-GPU has gpul, which is restricted to 32 GPUs per user.

In order to apply for access to QOSL, please email support@hpc.cam.ac.uk detailing why this mode of usage is necessary, and explaining why checkpointing is not a practical option for your application.

Submitting long jobs

Use of QOSL is tied to the -long partitions, therefore once given access it is necessary only to specify this partition - e.g.

sbatch -t 7-0:0:0 -p skylake-long -A YOUR_PROJECT ...