Slurm tutorial

To use the compute nodes in our Slurm clusters (HPC Cluster LEO5 and teaching cluster LCC2), you submit jobs to the batch job scheduling system Slurm (unlike LEO3e and LEO4, which still use SGE).

As a user you may want to become familiar with the following commands: sbatch, srun, scancel, squeue, sacct and scontrol, which are briefly described in the following sections.

For more information see the respective man pages of these commands. Detailed information may be found in the Slurm documentation pages.

For those of you who want to re-use their work we have installed a tool to convert SGE-job-scripts to their Slurm equivalent.
You call it like this:

/usr/site/hpc/bin/sge2slurm.py sge-jobscript.sge > slurm-jobscript.slurm

Important differences between Slurm and SGE

Essentially SGE and Slurm are very similar as both are batch schedulers. But apart from the different commands and their options (described below) there are these important design differences between our SGE-setup and our Slurm configuration:

  • Memory: the memory-related options of Slurm apply to the amount of "real" memory programs are using (aka RSS). In SGE you request virtual memory (-l h_vmem). As a result when moving to Slurm you might be able to reduce your values for memory reservations considerably since some programs need much more virtual memory than what they really need as RSS.
  • Resource-over-consumption: we configured Slurm in a way that when your program uses more resources (CPUs and memory) than you requested in your job-options it will behave in a different way from SGE:
    • If a program exceeds its memory limit, it will not be terminated but it will continue to run. However, parts of its memory will be paged out (and in) to disk which will extremly slow things down, so you should definitely avoid this. (??? Notification planned ???)
    • Slurm uses so called control groups to confine your programs to the requested number of cores (per node). I.e. if one of your program spawns more processes or threads then your requested number of cores it will not impair the performance of other jobs on the same node, but your job may run more slowly than expected.
  • The default run time on LEO5 is 3 days and may be expanded to 10 days if needed. Please note that excessive run times negatively affect overall responsitvity of the cluster.

Submitting jobs (sbatch)

The command sbatch is used to submit jobs to the batch-system using the following syntax:

sbatch [options] [job_script.slurm [ job_script_arguments ...]]

where job_script.slurm represents the (relative or absolute) path to a simple shell script containing the commands to be run on the cluster nodes. We recommend to use the suffix .slurm to distinguish from scripts intended for other uses. If no file is specified, sbatch will read a script from standard input.

The first line of this script needs to start with #! followed by the path to an interpreter. For instance #!/bin/sh or #!/bin/bash (or any other available shell of your taste) but note that we currently only support Bash (/bin/bash). Your script may use common Bash functionality such as I/O redirection using the < and > characters, loops, case constructs etc., but please keep it simple. If your setup uses a different shell or needs a complex script, simply call your script from within the batch job script.

Slurm will start your job on any nodes that have the necessary resources available or put your job in a waiting queue until requested resources become available. Slurm uses recent resource consumption in deciding which waiting jobs to start first (fair share scheduling). After submitting your job, you may continue to work or log out - job scheduling is completely independent of interactive work. The only way to stop a job after it has been submitted is to use the scancel command described below.

If you submit more than one job at the same time, you need to make sure that individual jobs (which may be executed simultaneously) do not interfere with each other by e.g. writing to the same files.

The options tell the sbatch command how to behave: job name, use of main memory and run time, parallelization method, etc.

There are two ways of supplying these options to the sbatch command:

Method 1:

You may add the options directly to the sbatch command line, like:

sbatch --job-name=job_name --ntasks=number_of_tasks --cpus-per-task=number_of_cpus_per_task --mem-per-cpu=memory_per_cpu job_script.slurm [ argument ... ]

Method 2 (recommended):

Add the sbatch options to the beginning of your job_script, one option per line.

Note that the lines prefixed with #SBATCH are parsed by the sbatch command, but are treated as comments by the shell.

Taking above example, the contents of job_script.slurm would look like:

#!/bin/bash

#SBATCH --job-name=job_name
#SBATCH --ntasks=number_of_tasks
#SBATCH --cpus-per-task=number_of_cpus_per_task
#SBATCH --mem-per-cpu=memory_per_cpu

./your_commands

If you give conflicting options both in the job file and the sbatch command line, the command line options take precedence. So you can use options in the job script to supply defaults that may be overriden in the sbatch command line.

Overview of commonly used options to sbatch

Job Name, Input, Output, Working Directory
--job-name=name
Name of the job.
Default: File name of the job script.
The job name is used in the default output of squeue (see below) and may be used in the filename pattern of input-, output- and error-file.
--output=filename_pattern
Standard output of the job script will be connected to the file specified by filename_pattern.
By default both standard output and standard error are directed to the same file. For normal jobs the default file name is slurm-%j.out, where "%j" is replaced by the job ID. For job arrays, the default file name is slurm-%A_%a.out, "%A" is replaced by the job ID and "%a" by the array index.
The working directory of the job script is the current working directory (where sbatch was called) unless the --chdir argument is given.
Filename patterns may use the following place-holders (for a full list see the documentation of sbatch):
  • %x   Job name.
  • %j   Job-ID.
  • %t   Task identifier (aka rank). This will create a seperate file per task.
  • %N   Short hostname. This will create a separate file per node.
Example: -o %x_%j_%N.out
please note: two jobs using the same output file name will clobber each other's output. Use the default or make sure that your filename_pattern includes %j.
--error=filename_pattern
Standard error of the job script will be connected to the file specified by filename_pattern as described above.
--input=filename pattern
Standard input of the job script will be connected to the file specified by filename pattern. By default, "/dev/null" is connected to the script's standard input.
--chdir=directory
Execute the job in the specified working directory. Input/output file names are relative to this directory.
Default: current working directory of sbatch-command.
Notifications
--mail-user=email_address
Notifications will be sent to this email address.
Default is to send mails to the local user submitting the job.
--mail-type=[TYPE|ALL|NONE]
Send notifications for the specified type of events (default: NONE).
Possible values for TYPE are BEGIN, END, FAIL, REQUEUE, STAGE_OUT, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS. Multiple types may be specified by using a comma speparated list.
ALL is equivalent to BEGIN,END,FAIL,REQUEUE,STAGE_OUT.
TIME_LIMIT sends a notification when the time limit of the running job is reached. TIME_LIMIT_XX sends one when XX percent of the limit are reached.
When ARRAY_TASKS is specified BEGIN, END and FAIL apply to each task in the job array (we strongly advise against using this)! Without this, these messages are sent for the array as a whole.
Time Limits
--time=time
Set a limit on the run time (wallclock time from start to termination) of the job. The default depends on the used partition.
When the time limit is reached, each task in each job step receives a TERM signal followed (after 30 seconds) by a KILL signal. So "trapping" the TERM signal and gracefully shutting down the script is possible.
Times may be specified as:
minutes,
days-hours, or
[[days-]hours:]minutes[:seconds].
If you know the runtime of your job before-hand it's a good idea to use this option to specify it as this helps the scheduler doing its resource planning and may result in an earlier start of your job.
TODO: see "Backfilling" below???
Memory Allocation
--mem=size[K|M|G|T]
Specify the memory required per node.

--mem-per-cpu=size[K|M|G|T]
Specify the memory required per CPU.

--mem-per-gpu=size[K|M|G|T]
Specify the memory required per GPU.

Nodes, Tasks, and CPUs
--ntasks=ntasks
Request resources for a total number of ntasks tasks.
Without further options (see below) the tasks are placed on free resources on any node (nodes are "filled up"). For MPI jobs, tasks map to MPI ranks.
Slurm will set the environment variable $SLURM_NTASKS to the number ntasks that you requested.
--nodes=n[-m]
--ntasks-per-node=ntasks
Request at least n and up to m nodes with ntasks each. If only one number is given (and not a range) it is interpreted as exactly this number of nodes.

Please note: Unless you have a good reason to explicitly control placement of tasks, do not use these options, but for best results let the system decide.
--cpus-per-task=ncpus
Tell Slurm that each task will require ncpus CPUs. Default is one CPU per task. This is the level at which multithreading (e.g. Posix threads or OpenMP threads) is specified.
Slurm will set the environment variable $SLURM_CPUS_PER_TASK to the number ncpus that you requested.
MPI + OpenMP hybrid jobs are natively supported by simultaneously setting ntasks and ncpus to values greater than 1.
GPUs
--gpus=[type:]number
Request number GPUs, optionally of type type. On LEO5, type is one of a30 or a40. GPU nodes have two GPUs installed on each node.
Job Arrays
--array=m-n[:step][%maxrunning]
Trivial parallelisation using a job array. This will start n-m+1 independent instances of your job (so called "array tasks") with a task ID ranging from m to n inclusive. At run time, each task has the following environment variables set:
VariableMeaning
SLURM_ARRAY_TASK_COUNT total number of tasks of your array
SLURM_ARRAY_TASK_ID ID of the current task
SLURM_ARRAY_TASK_MAX last ID
SLURM_ARRAY_TASK_MIN first ID
SLURM_ARRAY_TASK_STEP step (increment value) of the IDs of the array.

Appending %maxrunning to the array specification allows you to specify a maximum number of simultaneously running tasks. E.g.
-a 0-9%4 will run ten tasks in total but only a maximum of four simultaneously.

Instead of a range of IDs you can also give a comma separated list of values.

The minimum task-ID is 0, the maximum is 75000.

If the number of your job instances is substantially higher than about 10 please do not use ARRAY_TASKS in --mail-type (see above).

Job script validation and start estimate
--test-only
The job script is validated but not submitted. Additionally an estimate is shown of when a job would be scheduled to run with the current settings given in the job script and on the command line.
A note on maximum stack size

Our systems are configured in a way that the maximum allowed size of the stack of your programs is unlimited (unlike the default in most Linux systems where it is limited to 8 MB). Most programs will not need this but some will benefit from it.
There are edge-cases where (FORTRAN?) programs will not work with an unlimited stack size. In that case please limit stack size in your job-script before calling that program. With e.g.

ulimit -s 80000

you will set the limit to about 80 MB (80000 kB). This works because as a user you are allowed to lower the limit anytime.

Job submission examples

Parallel MPI job

The contents of your job script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash

# Name of your job.
#SBATCH --job-name=name

# Send status information to this email address. 
#SBATCH --mail-user=Karl.Mustermann@xxx.com

# Send an e-mail when the job has finished or failed. 
#SBATCH --mail-type=END,FAIL

# Start an MPI job with 80 single threaded tasks
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1

# In this example we allocate ressources for 80 MPI processes/tasks,
# placing exactly 10 tasks on each of 8 separate nodes like this:
## #SBATCH --ntasks-per-node=10
## #SBATCH --nodes=8
## #SBATCH --cpus-per-task=1
# do this only when you have good reason to explicitly control
# task placement

# Specify the amount of memory given to each MPI process
# in the job.
#SBATCH --mem-per-cpu=1G

module purge
module load openmpi/xx.yy.zz

mpirun -n $SLURM_NTASKS ./your_mpi_executable [extra arguments]
Parallel OpenMP jobs

The contents of your job script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash

# Name of your job.
#SBATCH --job-name=name

# Send status information to this email address. 
#SBATCH --mail-user=Karla.Musterfrau@xxx.com

# Send an e-mail when the job has finished or failed. 
#SBATCH --mail-type=END,FAIL

# Allocate one task on one node and six cpus for this task
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6

# Allocate 12 Gigabytes for the whole node/task
#SBATCH --mem=12G

# tell OpenMP how many threads to start
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./your_openmp_executable

Important: We have configured Slurm to use control groups in order to limit access of your job to memory and cpus.
If your job uses shared memory parallelization other than OpenMP, you should check that the number of CPU-intensive threads is consistent with the number of slots assigned to the job.
If you start more threads than you requested in the --cpus-per-task directive, these will be forced to run on the requested amount of cores so they will interfere with each other, possibly degrading the overall efficiency of your job.

Interactive jobs (srun --pty)

The submission of interactive jobs is useful in situations where a job requires some sort of direct intervention.

This is usually the case for X-Windows applications or in situations in which further processing depends on your interpretation of immediate results. A typical example for both of these cases is a graphical debugging session.

Note: Interactive sessions are also particularly helpful for getting acquainted with the system or when building and testing new programs.

Interactive sessions might be sequential or parallel:

Sequential
(one CPU on one node)
srun --pty bash
Parallel (shared memory)
(n CPUs on one node)
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=n --pty bash
Parallel (distributed memory)
(n CPUs on each of m nodes)
Either
srun --ntasks-per-node=n --nodes=m --cpus-per-task=1 --pty bash
or
srun --ntasks=x --pty bash
x being n * m.

In a multi-node parallel (aka distributed memory) interactive session you can use srun (or after loading an MPI module mpiexec) to run programs on all nodes.

For using an X-Windows application, supply --x11 as a further option, e.g. like this:

srun --pty --x11 xterm -ls

Prepare your session as needed, e.g. by loading all necessary modules within the provided xterm and then start your program on the executing node.

Note: Make sure to end your interactive session (logging out or closing the xterm window) as soon as it is no longer needed!

Monitoring jobs (squeue, srun and sacct)

To get information about running or waiting jobs use

squeue [options]
or
sq [options]

The command squeue displays a list of all running or waiting jobs of all users.

The locally implemented command sq displays more fields than squeue does by default.

The (in our opinion) most interesting additional field is START_TIME which for pending jobs shows the date and time when Slurm plans to run this job. It is always possible that a job will start earlier but not (much) later.

Slurm calculates this field only once per minute so it might not contain a meaningful value right after submitting a job.

squeue and sq display the jobs of all users which might not always be what you want. So we created another shortcut for you:
squ
which is a shorter way of typing sq -u $USER and thus lists only the jobs belonging to you.

You can further inspect a running job by "connecting" to it with this command:

srun --jobid=jobid --pty bash

This will open an interactive shell as a job step under an already allocated job. I.e. you will be able to see how your job is "behaving". For distributed memory jobs you will get a shell at the first node used by your job.

To get information about past jobs use

sacct -X [options]

TODO: Ausführlicher!

Altering jobs (scontrol update)

You can change the configuration of pending jobs with

scontrol update job jobid SETTING=VALUE [...]

To find out which settings are available we recommend to first run
scontrol show job jobid.

If then for example you want to change the run-time limit of your job to let's say three hours you would use
scontrol update job jobid TimeLimit=3:00

Some adaption might require you to change more than one setting. If e.g. your Job is flexible wrt to the number of used tasks and nodes and you want to change those after having submitted a job you would have to run
scontrol update job jobid NumTasks=xx NumCPUs=xx NumNodes=y

Deleting jobs (scancel)

To delete pending or running jobs you have to look up their numerical job identifiers aka job-ids. You can e.g. use squeue or squ and take the value(s) from the JOBID column.
Then execute

scancel job-id [...]

and the corresponding job will be removed from the wait queue or stopped if it's running.
You can specify more than one job identifier after scancel.
If you want to cancel all your pending and running jobs without being asked for confirmation you may use
squ -h -o %i | xargs scancel.
Those options tell squeue to only output the JOBID column (-o %i) and to omit the column header (-h).

Nach oben scrollen