# Long jobs¶

This page describes how suitable applications requiring unusually long execution times can be run on CSD3.

## Definition of long jobs¶

We define long jobs as jobs requiring wall times (i.e. real execution times) of up to 7 days. Continuous execution times of these lengths are normally disallowed by both non-paying and paying qualities of service (QoS), in order to achieve reasonable overall throughput.

In general it is advisable for any application running for extended periods to have the ability to save its progress (i.e. to checkpoint) as insurance against unexpected failures that may result in wastage of significant resources. Applications for which it is possible to checkpoint are largely immune from per-job runtime limits as they can simply resume from the most recent checkpoint in the guise of a new job. Applications for which it is not feasible to checkpoint may find the scheduling features described below to be of use.

## Checkpointing with DMTCP¶

Note that CSD3 has three modules for DMTCP installed. DMTCP creates checkpoints that allow the resume of an application in case it failed to complete in a specified time period (i.e. maximum of 12 hours for SL-3 and 36 hours for SL-2). These checkpoints are snapshots of a running application, which includes all the memory structures and optionally files created by the application. DMTCP was created for applications that don’t have mid-run stop and resume capabilities. Using a checkpoint DMTCP can resume the application and from the application’s “point of view” it seems that it was never stopped. Some jobs may be able to use DMTCP to accumulate extended run times without needing to request the special QoS described below.

DMTCP can be used on CSD3 explicitly within a job script. We tested DMTCP for single and multinode applications, however, we cannot guarantee that it will work for all applications. The user should do tests and use the appropriate DMTCP commands and options for his/hers application.

The installed DMTCP modules are dmtcp/2.3.1, dmtcp/2.6.0 and dmtcp/2.6.0-intel-17.0.4. The latter module is to be used when any Intel module is loaded. An example submission script that demonstrates the use of DMTCP is the following:

#!/bin/bash
#SBATCH -J dmtcp_serial
#SBATCH -A CHANGE-ME
#SBATCH --output=dmtcp_serial_%A.out
#SBATCH --error=dmtcp_serial_%A.err
#SBATCH --nodes=1
#SBATCH --time=00:01:00
#SBATCH -p skylake

. /etc/profile.d/modules.sh
module purge
ulimit -s 8192

RESTARTSCRIPT="dmtcp_restart_script.sh"
export DMTCP_QUIET=2

runcmd="./example_serial 5"
tint=30

echo "Start coordinator"
date
eval "dmtcp_coordinator --daemon --coord-logfile dmtcp_log.txt --exit-after-ckpt --exit-on-last -i "$tint" --port-file cport.txt -p 0" sleep 2 cport=$(<cport.txt)
echo "$cport" h=hostname echo$h

if [ -f "$RESTARTSCRIPT" ] then echo "Resume the application" CMD="dmtcp_restart -p "$cport" -i "$tint" ckpt*.dmtcp" echo$CMD
eval $CMD else echo "Start the application" CMD="dmtcp_launch --rm --infiniband --no-gzip -h localhost -p "$cport" "$runcmd echo$CMD
eval $CMD fi echo "Stopped program execution" date  The above submission script executes the C++ application example (source code shown below) which prints to the screen and stores in a text file (i.e. example.txt) the value of an integer variable every one second. The value increases by one each second. In the above submit script DMTCP is set to create a checkpoint 5 seconds after the start of the application. After the interruption of the submitted job it may be resumed by resubmitting the job using the same submission script. Moreover, the command ulimit -s 8192 in the submit script is required by DMTCP to work. #include <iostream> #include <chrono> #include <thread> #include <fstream> int main(int argc, char *argv[]) { using namespace std::this_thread; using namespace std::chrono; std::cout << "Given command line arguments: " << std::endl; for (int i = 1; i < argc; ++i) { std::cout << argv[i] << std::endl; } std::ofstream output_file; output_file.open ("example_output.txt"); for (int i=0; i<120; i++) { sleep_for(seconds(1)); std::cout << "Count: " << i << std::endl; output_file << i << "\n"; } output_file.close(); std::cout << "Example program end." << std::endl; return 0; }  DMTCP has a wide range of parameters that could have different effects to different types of applications. The user should experiment with their specific application to make sure that checkpoints are created successfully. For example, applications that write files might need the option --ckpt-open-files. This option stores the generated files as part of the checkpoint. The deletion of these files is required because DMTCP creates a copy as part of the checkpoint. An example submit script for multiple nodes is below: #!/bin/bash #SBATCH -J dmtcp_mpi #SBATCH -A CHANGE-ME #SBATCH --output=dmtcp_mpi_%A.out #SBATCH --error=dmtcp_mpi_%A.err #SBATCH --nodes=3 #SBATCH --ntasks=96 #SBATCH --cpus-per-task=1 #SBATCH --time=00:05:00 #SBATCH -p skylake . /etc/profile.d/modules.sh module purge module load rhel7/default-peta4 module load dmtcp/2.6.0-intel-17.0.4 ulimit -s 8192 RESTARTSCRIPT="dmtcp_restart_script.sh" export DMTCP_QUIET=0 runcmd="./example_mpi" tint=120 echo "Start coordinator" date eval "dmtcp_coordinator --daemon --coord-logfile dmtcp_log.txt --exit-after-ckpt --exit-on-last -i "$tint" --port-file cport.txt -p 0"
sleep 2
cport=$(<cport.txt) echo$cport
h=hostname
echo $h export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$cport HOSTFILE=hostfile echo "SLURM_JOB_NODELIST" | scontrol show hostname >$HOSTFILE

if [ -f "$RESTARTSCRIPT" ] then echo "Resume the application" ./dmtcp_restart_script.sh -h$DMTCP_COORD_HOST -p $DMTCP_COORD_PORT else echo "Start the application" mpirun -env I_MPI_FABRICS tcp -ppn 32 -np 96 dmtcp_launch --no-gzip --rm$runcmd
fi

echo "Stopped program execution"
date


Complete examples can be found at https://github.com/RSE-Cambridge/dmtcp-tests.

## The QOSL QoS¶

Paying users with suitable applications may be granted access to the QOSL quality of service, which permits jobs running for up to 7 days. Jobs associated with this special QoS are confined to -long variants of the usual partitions (skylake-long, knl-long, pascal-long). It is expected that users wishing to run for a long time are prepared to let others do so too while waiting for a start time.

QOSL is implemented by three SLURM QoS definitions, one per cluster. Peta4-Skylake has cpul, which is restricted to 640 cpus per user. Peta4-KNL has knll, which is restricted to 64 nodes per user. Wilkes2-GPU has gpul, which is restricted to 32 GPUs per user.

In order to apply for access to QOSL, please email support@hpc.cam.ac.uk detailing why this mode of usage is necessary, and explaining why checkpointing is not a practical option for your application.

## Submitting long jobs¶

Use of QOSL is tied to the -long partitions, therefore once given access it is necessary only to specify this partition - e.g.

sbatch -t 7-0:0:0 -p skylake-long -A YOUR_PROJECT ...