Running Jobs on CSD3

SLURM is an open source workload management and job scheduling system. Research Computing clusters adopted SLURM in February 2014, but previously used Torque, Maui/Moab and Gold (referred to in the following simply as “PBS”) for the same purpose. Please note that there are several commands available in SLURM with the same names as in PBS (e.g. showq, qstat, qsub, qdel, qrerun) intended for backwards compatibility, but in general we would recommend using the native SLURM commands, as described below.

If you have any questions on how to run jobs on CSD3 do not hesitate to contact the support desk.

Accounting Commands

The following commands are wrappers around the underlying SLURM commands sacct and sreport which are much more powerful.

Note that project names in SLURM are not case sensitive.

What resources do I have available to me?

This is the first question to settle before submitting jobs to CSD3. Use the command

mybalance

to show your projects, your current usages and the remaining balances in compute unit hours.

On CSD3 we are using natural compute units for each component of the facility:

  • on Peta4-Skylake we are allocating and reporting in CPU core hours
  • on Peta4-KNL we are allocating and reporting in KNL node hours
  • on Wilkes2-GPU we are allocating and reporting in GPU hours.

We have adopted the convention that projects containing Peta4-Skylake CPU hours will end in -CPU, while those holding GPU hours for Wilkes2-GPU end in -GPU, and projects containing Peta4-KNL node hours end in -KNL.

The projects listed by mybalance are the projects you may specify in SLURM submissions either through

#SBATCH -A project

in the job submission script or equivalently on the command line with

sbatch -A project ...

Where -CPU projects should be used for Peta4-Skylake jobs, -KNL projects for Peta4-KNL and -GPU projects for Wilkes2. See the Submitting jobs section for details on submitting to each cluster.

How many core hours does some other project or user have?

gbalance -p T2BENCH-SL2-CPU
User           Usage | Account Usage         | Account Limit Available (hours)
---------- --------- + -------------- ------ + ------------- ---------
xyz10              0 | T2BENCH-SL2-CPU     0 | 200,000          200,000

This outputs the total usage in core hours accumulated to date for the project, the total awarded and total remaining available (i.e. to all members). It also prints the component of the total usage due to each member.

I would like a listing of all jobs I have submitted through a certain project and between certain times

gstatement -p SUPPORT-CPU -u xyz10 -s "2017-10-01-00:00:00" -e "2017-11-22-23:59:59"
JobID User Account JobName Partition End ExitCode State CompHrs
------------ --------- ---------- ---------- ---------- ------------------- -------- ---------- --------
204815 xyz10 support-c+ _interact+ skylake 2017-10-20T16:20:07 0:0 COMPLETED 0.9
261251 xyz10 support-c+ _interact+ skylake 2017-11-09T17:39:43 0:0 TIMEOUT 1.0
262050 xyz10 support-c+ _interact+ skylake 2017-11-11T14:00:03 0:0 CANCELLED+ 1.5
262051 xyz10 support-c+ _interact+ skylake-h+ 2017-11-11T14:00:03 0:0 CANCELLED+ 0.7
...

This lists the charge for each job in the CompHrs column. Since this example queries usage of a -CPU project, these are CPU core hours. Similarly, for a -GPU project they would be GPU hours, and for a -KNL project they would be node hours.

I would like to add core hours to a particular member of my group

gdeposit -z 10000 -p halos-sl2-spqr1-gpu

This coordinator of the HALOS-SL2-GPU might use this to add 10000 GPU hours to the HALOS-SL2-SPQR1-GPU subproject assigned to the user spqr1. Note that if a compute hour limit applies to the parent of the project in the project hierarchy - i.e. if the parent project HALOS-SL2-GPU has an overall compute hour limit (which it almost certainly does) - then the global limit will still apply across all per-user projects.

Compute hours may be added to a project by a designated project coordinator user. Reducing the compute hours available to a project is also possible by adding a negative number of hours via the –time= syntax, e.g. the following command undoes the above:

gdeposit --time=-10000 -p halos-sl2-spqr1-gpu

Submitting jobs

Sample submission scripts

In normal use of SLURM, one creates a batch job which is a shell script containing the set of commands to run, plus the resource requirements for the job which are coded as specially formatted shell comments at the top of the script. The batch job script is then submitted to SLURM with the sbatch command. A job script can be resubmitted with different parameters (e.g. different sets of data or variables).

Please copy and edit the sample submission scripts that can be found under

/usr/local/Cluster-Docs/SLURM

New user accounts also have symbolic links to template files in their home directories. Lines beginning #SBATCH are directives to the batch system. The rest of each directive specifies arguments to the sbatch command. SLURM stops reading directives at the first executable (i.e. non-blank, and doesn’t begin with #) line.

The main directives to modify are as in the following:

#! Which project should be charged:
#SBATCH -A MYPROJECT-CPU
#! Which partition/cluster am I using?
#SBATCH -p skylake
#! How many nodes should be allocated? If not specified SLURM
#! assumes 1 node.
#SBATCH --nodes=2
#! How many tasks will there be in total? By default SLURM
#! will assume 1 task per node and 1 CPU per task.
#SBATCH --ntasks=64
#! How much memory in MB is required _per node_? Not setting this
#! as here will lead to a default amount per task.
#! Setting a larger amount per task increases the number of CPUs.
##SBATCH --mem=
#! How much wallclock time will be required?
#SBATCH --time=02:00:00

in particular, the name of the project is required for the job to be scheduled (use the command mybalance to check what this is for you in case of doubt). Charging is reported in units of compute hours (what these represent depends on the cluster).

See the following sections for more details on the setting of directives for each of the three CSD3 clusters.

Peta4-Skylake

Peta4-Skylake assigns usage in units of CPU core hours. By convention projects containing CPU core hours have names ending in -CPU.

Jobs require the partitions skylake or skylake-himem, i.e.

#SBATCH -p skylake

or

#SBATCH -p skylake-himem

and will be allocated the number of CPUs required for the number of tasks requested and a corresponding amount of memory.

By default, the skylake partition provides 1 CPU and 5980MB of RAM per task, and the skylake-himem partition provides 1 CPU and 12030MB per task.

Requesting more CPUs per task, or more memory per task, may both increase the number of CPUs allocated (and hence the charge). It is more cost efficient to submit jobs requiring more than 5980MB per task to the skylake-himem partition since more memory per CPU is natively available there.

NB Hyperthreading is disabled on the Skylake nodes so there is no distinction between CPUs and cores.

Peta4-KNL

Peta4-KNL assigns usage in units of KNL node hours. By convention projects containing KNL node hours have names ending in -KNL.

Jobs require the partition knl, i.e.

#SBATCH -p knl

and will be allocated entire KNL nodes. Each KNL node has 64 physical cores but presents 256 cpus via hyperthreading, has 96GB DDR RAM plus 16GB MCDRAM high bandwidth memory and has been configured in quadrant/cache mode by default (in cache mode, the MCDRAM works invisibly as cache).

It is possible to vary the MCDRAM mode required at job submission time - please use either –constraint or the equivalent -C sbatch option to select the mode. We recommend using either

#SBATCH --constraint=cache

or

#SBATCH --constraint=flat

Flat mode makes the MCDRAM visible as a second 16GB NUMA zone. Please note that hybrid MCDRAM mode, or any NUMA mode other than quad(rant), are not recommended.

Wilkes2-GPU

Wilkes2-GPU assigns usage in units of GPU hours. By convention projects containing GPU hours have names ending in -GPU.

Jobs require the partition pascal, i.e.

#SBATCH -p pascal

and may request any number of GPUs per node from the range 1 to 4, which is done via the directive

#SBATCH --gres=gpu:N

where 1 <= N <= 4.

Each GPU node contains 4 NVIDIA Pascal P100 GPUs, with 96GB RAM and a single 12-core Broadwell processor.

Any jobs requesting more than one node must request 4 GPUs per node. Jobs less than one node in size will be prevented from requesting more than 3 CPUs per GPU. The enforcement is performed by a job submission filter which will produce an explanatory message if it rejects a job outright.

Submitting the job to the queuing system

The command sbatch is used to submit jobs, e.g.

sbatch submission_script

The command will return a unique job identifier, which is used to query and control the job and to identify output. See the man page (man sbatch) for more options.

The following more complex example submits a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) to the project STARS-SL2-CPU:

sbatch --array=1-7:2 -A STARS-SL2-CPU submission_script

Deleting jobs

To cancel a job (either running or still queuing) use scancel:

scancel <jobid>

The <jobid> is printed when the job is submitted, alternatively use the commands squeue, qstat or showq to obtain the job ID.

Array jobs

Array jobs allow the submission of multiple similar jobs. Array jobs should be prefered to multiple sbatch calls in a loop. An example of a submission script is shown below. In this example, 32 jobs are submitted. Each creates a folder with name the SLURM_ARRAY_TASK_ID of the job. Then enters this folder and runs the example executable with command line argument 5.

#!/bin/bash
#! This line is a comment
#! Make sure you only have comments and #SBATCH directives between here and the end of the #SBATCH directives, or things will break
#! Name of the job:
#SBATCH -J test_dmtcp
#! Account name for group, use SL2 for paying queue:
#SBATCH -A MYPROJECT-CPU
#! Output filename:
#! %A means slurm job ID and %a means array index
#SBATCH --output=test_dmtcp_%A_%a.out
#! Errors filename:
#SBATCH --error=test_dmtcp_%A_%a.err

#! Number of nodes to be allocated for the job (for single core jobs always leave this at 1)
#SBATCH --nodes=1
#! Number of tasks. By default SLURM assumes 1 task per node and 1 CPU per task. (for single core jobs always leave this at 1)
#SBATCH --ntasks=1
#! How many many cores will be allocated per task? (for single core jobs always leave this at 1)
#SBATCH --cpus-per-task=1
#! Estimated runtime: hh:mm:ss (job is force-stopped after if exceeded):
#SBATCH --time=00:10:00
#! Estimated maximum memory needed (job is force-stopped if exceeded):
#! RAM is allocated in ~5980mb blocks, you are charged per block used,
#! and unused fractions of blocks will not be usable by others.
#SBATCH --mem=5980mb
#! Submit a job array with index values between 0 and 31
#! NOTE: This must be a range, not a single number (i.e. specifying '32' here would only run one job, with index 32)
#SBATCH --array=0-31

#! This is the partition name.
#SBATCH -p skylake

#! mail alert at start, end and abortion of execution
#! emails will default to going to your email address
#! you can specify a different email address manually if needed.
##SBATCH --mail-type=ALL

#! Don't put any #SBATCH directives below this line

#! Modify the environment seen by the application. For this example we need the default modules.
. /etc/profile.d/modules.sh                # This line enables the module command
module purge                               # Removes all modules still loaded
module load rhel7/default-peta4            # REQUIRED - loads the basic environment

#! The variable $SLURM_ARRAY_TASK_ID contains the array index for each job.
#! In this example, each job will be passed its index, so each output file will contain a different value
echo "This is job" $SLURM_ARRAY_TASK_ID

#! Command line that we want to run:
jobDir=Job_$SLURM_ARRAY_TASK_ID
mkdir $jobDir
cd $jobDir

../example 5

The C++ source code of the example executable is shown below and can be compiled using g++ -std=c++11 example.cpp -o example. The example.cpp is the file with the source code.

#include <iostream>
#include <chrono>
#include <fstream>
#include <thread>

int main(int argc, char *argv[])
{
    std::ofstream myfile;
    myfile.open ("example.txt");

    for (int i=0; i<500; i++)
    {
        std::this_thread::sleep_for(std::chrono::seconds(1));

        std::cout << "Count: " << i << std::endl;
        myfile << i << std::endl;
    }

    myfile.close();
}

The user is encouraged to experiment with this example before attempting more complex job array submissions.