Running Jobs on CSD3

SLURM is an open source workload management and job scheduling system. Research Computing clusters adopted SLURM in February 2014, but previously used Torque, Maui/Moab and Gold (referred to in the following simply as “PBS”) for the same purpose. Please note that there are several commands available in SLURM with the same names as in PBS (e.g. showq, qstat, qsub, qdel, qrerun) intended for backwards compatibility, but in general we would recommend using the native SLURM commands, as described below.

If you have any questions on how to run jobs on CSD3 do not hesitate to contact the support desk.

Accounting Commands

The following commands are wrappers around the underlying SLURM commands sacct and sreport which are much more powerful.

Note that project names in SLURM are not case sensitive.

What resources do I have available to me?

This is the first question to settle before submitting jobs to CSD3. Use the command

mybalance

to show your projects, your current usages and the remaining balances in compute unit hours.

On CSD3 we are using natural compute units for each component of the facility:

  • on CPU nodes we are allocating and reporting in CPU core hours
  • on GPU nodes we are allocating and reporting in GPU hours.

We have adopted the convention that projects containing CPU hours for use on CPU nodes will end in -CPU, while those holding GPU hours for use on GPU nodes end in -GPU.

The projects listed by mybalance are the projects you may specify in SLURM submissions either through

#SBATCH -A project

in the job submission script or equivalently on the command line with

sbatch -A project ...

Where -CPU projects should be used for CPU jobs and -GPU projects for GPU. See the Submitting jobs section for details on submitting to each cluster.

How many core hours does some other project or user have?

gbalance -p T2BENCH-SL2-CPU
User           Usage | Account Usage         | Account Limit Available (hours)
---------- --------- + -------------- ------ + ------------- ---------
xyz10              0 | T2BENCH-SL2-CPU     0 | 200,000          200,000

This outputs the total usage in core hours accumulated to date for the project, the total awarded and total remaining available (i.e. to all members). It also prints the component of the total usage due to each member.

I would like a listing of all jobs I have submitted through a certain project and between certain times

gstatement -p SUPPORT-CPU -u xyz10 -s "2023-09-06-00:00:00" -e "2023-09-06-23:59:59"
JobID User Account JobName Partition End ExitCode State CompHrs
------------ --------- ---------- ---------- ---------- ------------------- -------- ---------- --------
26658759         xyz10 support-c+        gmx    icelake 2023-09-06T09:09:24      2:0     FAILED      0.0
26658762         xyz10 support-c+    gmx_mpi    icelake 2023-09-06T09:10:27      0:0 CANCELLED+      0.0
26658989         xyz10 support-c+ sys/dashb+     cclake 2023-09-06T17:48:58      0:0    TIMEOUT     16.0
26659000         xyz10 support-c+ _interact+     cclake 2023-09-06T09:55:36      0:0    TIMEOUT      0.2
...

This lists the charge for each job in the CompHrs column. Since this example queries usage of a -CPU project, these are CPU core hours. Similarly, for a -GPU project they would be GPU hours.

I would like to add core hours to a particular member of my group

gdeposit -z 10000 -p halos-sl2-spqr1-gpu

This coordinator of the HALOS-SL2-GPU might use this to add 10000 GPU hours to the HALOS-SL2-SPQR1-GPU subproject assigned to the user spqr1. Note that if a compute hour limit applies to the parent of the project in the project hierarchy - i.e. if the parent project HALOS-SL2-GPU has an overall compute hour limit (which it almost certainly does) - then the global limit will still apply across all per-user projects.

Compute hours may be added to a project by a designated project coordinator user. Reducing the compute hours available to a project is also possible by adding a negative number of hours via the –time= syntax, e.g. the following command undoes the above:

gdeposit --time=-10000 -p halos-sl2-spqr1-gpu

Submitting jobs

Sample submission scripts

In normal use of SLURM, one creates a batch job which is a shell script containing the set of commands to run, plus the resource requirements for the job which are coded as specially formatted shell comments at the top of the script. The batch job script is then submitted to SLURM with the sbatch command. A job script can be resubmitted with different parameters (e.g. different sets of data or variables).

Please copy and edit the sample submission scripts that can be found under

/usr/local/Cluster-Docs/SLURM

New user accounts also have symbolic links to template files in their home directories. Lines beginning #SBATCH are directives to the batch system. The rest of each directive specifies arguments to the sbatch command. SLURM stops reading directives at the first executable (i.e. non-blank, and doesn’t begin with #) line.

The main directives to modify are as in the following:

#! Which project should be charged:
#SBATCH -A MYPROJECT-CPU
#! Which partition/cluster am I using?
#SBATCH -p cclake
#! How many nodes should be allocated? If not specified SLURM
#! assumes 1 node.
#SBATCH --nodes=1
#! How many tasks will there be in total? By default SLURM
#! will assume 1 task per node and 1 CPU per task.
#SBATCH --ntasks=56
#! How much memory in MB is required _per node_? Not setting this
#! as here will lead to a default amount per task.
#! Setting a larger amount per task increases the number of CPUs.
##SBATCH --mem=
#! How much wallclock time will be required?
#SBATCH --time=02:00:00

in particular, the name of the project is required for the job to be scheduled (use the command mybalance to check what this is for you in case of doubt). Charging is reported in units of compute hours (what these represent depends on the cluster).

See the following sections for more details on the setting of directives for each of the CSD3 clusters.

Cascade Lake/Ice Lake/Sapphire Rapids

The cclake-* partitions assign usage in units of CPU core hours. By convention projects containing CPU core hours have names ending in -CPU.

Jobs require the partitions cclake or cclake-himem, i.e.

#SBATCH -p cclake

or

#SBATCH -p cclake-himem

and will be allocated the number of CPUs required for the number of tasks requested and a corresponding amount of memory.

By default, the cclake partition provides 1 CPU and 3420MB of RAM per task, and the cclake-himem partition provides 1 CPU and 6840MB per task.

Requesting more CPUs per task, or more memory per task, may both increase the number of CPUs allocated (and hence the charge). It is more cost efficient to submit jobs requiring more than 3420MB per task to the cclake-himem partition since more memory per CPU is natively available there.

NB Hyperthreading is disabled so there is no distinction between CPUs and cores.

In addition to cclake there are now also icelake and sapphire partitions which work in a similar way. The precise numbers of CPUs per node and mount of memory available per CPU differ across these types since the hardware differs. Please see the directory

/usr/local/Cluster-Docs/SLURM/

for template submission scripts appropriate to each of these node types (including for GPU).

Wilkes3-GPU

Wilkes3-GPU assigns usage in units of GPU hours. By convention projects containing GPU hours have names ending in -GPU.

Jobs require the partition ampere, i.e.

#SBATCH -p ampere

and may request any number of GPUs per node from the range 1 to 4, which is done via the directive

#SBATCH --gres=gpu:N

where 1 <= N <= 4.

Each GPU node contains 4 NVIDIA Ampere A100 GPUs, with 1000GB host RAM and two AMD EPYC 64-core processors.

Any jobs requesting more than one node must request 4 GPUs per node. Jobs less than one node in size will be prevented from requesting more than 3 CPUs per GPU. The enforcement is performed by a job submission filter which will produce an explanatory message if it rejects a job outright.

Submitting the job to the queuing system

The command sbatch is used to submit jobs, e.g.

sbatch submission_script

The command will return a unique job identifier, which is used to query and control the job and to identify output. See the man page (man sbatch) for more options.

The following more complex example submits a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) to the project STARS-SL2-CPU:

sbatch --array=1-7:2 -A STARS-SL2-CPU submission_script

Deleting jobs

To cancel a job (either running or still queuing) use scancel:

scancel <jobid>

The <jobid> is printed when the job is submitted, alternatively use the commands squeue, qstat or showq to obtain the job ID.

Array jobs

Array jobs allow the submission of multiple similar jobs. Array jobs should be prefered to multiple sbatch calls in a loop. An example of a submission script is shown below. In this example, 32 jobs are submitted. Each creates a folder with name the SLURM_ARRAY_TASK_ID of the job. Then enters this folder and runs the example executable with command line argument 5.

#!/bin/bash
#! This line is a comment
#! Make sure you only have comments and #SBATCH directives between here and the end of the #SBATCH directives, or things will break
#! Name of the job:
#SBATCH -J test_job
#! Account name for group, use SL2 for paying queue:
#SBATCH -A MYPROJECT-CPU
#! Output filename:
#! %A means slurm job ID and %a means array index
#SBATCH --output=test_job_%A_%a.out
#! Errors filename:
#SBATCH --error=test_job_%A_%a.err

#! Number of nodes to be allocated for the job (for single core jobs always leave this at 1)
#SBATCH --nodes=1
#! Number of tasks. By default SLURM assumes 1 task per node and 1 CPU per task. (for single core jobs always leave this at 1)
#SBATCH --ntasks=1
#! How many many cores will be allocated per task? (for single core jobs always leave this at 1)
#SBATCH --cpus-per-task=1
#! Estimated runtime: hh:mm:ss (job is force-stopped after if exceeded):
#SBATCH --time=00:10:00
#! Estimated maximum memory needed (job is force-stopped if exceeded):
#! RAM is allocated in 3420mb blocks, you are charged per block used,
#! and unused fractions of blocks will not be usable by others.
#SBATCH --mem=3420mb
#! Submit a job array with index values between 0 and 31
#! NOTE: This must be a range, not a single number (i.e. specifying '32' here would only run one job, with index 32)
#SBATCH --array=0-31

#! This is the partition name.
#SBATCH -p cclake

#! mail alert at start, end and abortion of execution
#! emails will default to going to your email address
#! you can specify a different email address manually if needed.
##SBATCH --mail-type=ALL

#! Don't put any #SBATCH directives below this line

#! Modify the environment seen by the application. For this example we need the default modules.
. /etc/profile.d/modules.sh                # This line enables the module command
module purge                               # Removes all modules still loaded
module load rhel7/default-peta4            # REQUIRED - loads the basic environment

#! The variable $SLURM_ARRAY_TASK_ID contains the array index for each job.
#! In this example, each job will be passed its index, so each output file will contain a different value
echo "This is job" $SLURM_ARRAY_TASK_ID

#! Command line that we want to run:
jobDir=Job_$SLURM_ARRAY_TASK_ID
mkdir $jobDir
cd $jobDir

../example 5

The C++ source code of the example executable is shown below and can be compiled using g++ -std=c++11 example.cpp -o example. The example.cpp is the file with the source code.

#include <iostream>
#include <chrono>
#include <fstream>
#include <thread>

int main(int argc, char *argv[])
{
    std::ofstream myfile;
    myfile.open ("example.txt");

    for (int i=0; i<500; i++)
    {
        std::this_thread::sleep_for(std::chrono::seconds(1));

        std::cout << "Count: " << i << std::endl;
        myfile << i << std::endl;
    }

    myfile.close();
}

The user is encouraged to experiment with this example before attempting more complex job array submissions.