Slurm tutorial

To use the compute nodes in our HPC clusters, you submit jobs to the batch job scheduling system Slurm. As a user you may want to become familiar with the following commands: sbatch, srun, scancel, squeue, sacct and scontrol, which are briefly described in the following sections.

For more information see the respective man pages of these commands. Detailed information may be found in the Slurm documentation pages.

Submitting jobs (sbatch)

The command sbatch is used to submit jobs to the batch-system using the following syntax:

sbatch [options] [job_script [ job_script_arguments ...]]

where job_script represents the path to (or if it's in the current working directory the name of) a simple shell script containing the commands to be run on the cluster nodes. If no file is specified, sbatch will read a script from standard input.

The first line of this script needs to start with #! followed by the path to an interpreter. For instance #!/bin/sh or #!/bin/bash (or any other available shell of your taste) but note that we currently only support Bash (/bin/bash). Your script may use common Bash functionality such as I/O redirection using the < and > characters, loops, case constructs etc., but please keep it simple. If your setup uses a different shell or needs a complex script, simply call your script from within the batch job script.

Slurm will start your job on any nodes that have the necessary resources available or put your job in a waiting queue until requested resources become available. Slurm uses recent consumption of resources in deciding which waiting jobs to start first ("fair share" scheduling). After submitting your job, you may continue to work or log out - job scheduling is completely independent of interactive work. The only way to stop a job after it has been submitted is to use the scancel command described below.

If you submit more than one job at the same time, you need to make sure that individual jobs (which may be executed simultaneously) do not interfere with each other, e.g. by writing to the same files.

The options tell the sbatch command how to behave: job name, use of main memory and run time, parallelization method, etc.

There are two ways of supplying these options to the sbatch command:

Method 1:

You may add the options directly to the sbatch command line, like:

sbatch -J job_name --ntasks=number_of_tasks --ntasks-per-node=number_of_tasks_per_node --cpus-per-task=number_of_cpus_per_task --mem-per-cpu=memory_per_cpu job_script [ argument ... ]

Method 2 (recommended):

Add the sbatch options to the beginning of your job_script, one option per line.

Note that the lines prefixed with #SBATCH are parsed by the sbatch command, but are treated as comments by the shell.

Taking above example, the contents of job_script would look like:


#SBATCH -J job_name   #SBATCH --job-name=job_name is possible as well
#SBATCH --ntasks=number_of_tasks
#SBATCH --ntasks-per-node=number_of_tasks_per_node
#SBATCH --cpus-per-task=number_of_cpus_per_task
#SBATCH --mem-per-cpu=memory_per_cpu


If you give conflicting options both in the job file and the sbatch command line, the command line options take precedence. So you can use options in the job script to supply defaults that may be overriden in the sbatch command line.

Overview of commonly used options to sbatch

Job Name, Input, Output, Working Directory
-J name or --job-name=name Name of the job.
Default: File name of the job script.
The job name is used in the default output of squeue (see below) and may be used in the filename pattern of input-, output- and error-file.
By default both standard output and standard error are directed to the same file. For normal jobs the default file name is slurm-%j.out, where "%j" is replaced by the job ID. For job arrays, the default file name is slurm-%A_%a.out, "%A" is replaced by the job ID and "%a" by the array index.
The working directory of the job script is the directory where sbatch was called unless the --chdir argument is given.
-o filename pattern or --output=filename pattern
Standard output of the job script will be connected to the file specified by filename pattern.
Filename patterns may use the following place-holders (for a full list see the documentation of sbatch):
  • %x Job name.
  • %j Job-ID.
  • %t Task identifier (aka rank). This will create a seperate file per task.
  • %N Short hostname. This will create a separate file per node.
Example: -o %x_%j_%N.out
-e filename pattern or --error=filename pattern Standard error of the job script will be connected to the file specified by filename pattern.
-i filename pattern or --input=filename pattern Standard input of the job script will be connected to the file specified by filename pattern. By default, "/dev/null" is connected to the script's standard input.
-D directory or --chdir=directory Execute the job in the specified working directory. Input/output file names are relative to this directory.
--mail-user=email address Notifications will be sent to this email address.
Default is to send mails to the local user submitting the job.
--mail-type=[TYPE|ALL|NONE] Send notifications for the specified type of events (default: NONE).
Possible values for TYPE are BEGIN, END, FAIL, REQUEUE, STAGE_OUT, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS. Multiple types may be specified by using a comma speparated list.
TIME_LIMIT sends a notification when the time limit of the running job is reached. TIME_LIMIT_XX sends one when XX percent of the limit are reached.
When ARRAY_TASKS is specified BEGIN, END and FAIL apply to each task in the job array (we strongly advise against using this)! Without this, these messages are sent for the array as a whole.
-t time or --time=time Set a limit on the run time (wallclock time from start to termination) of the job. The default (=maximum) depends on the used partition.
When the time limit is reached, each task in each job step receives a TERM signal followed (after 30 seconds) by a KILL signal. So "trapping" the TERM signal and gracefully shutting down the script is possible.
Times may be specified as "minutes", "days-hours" or "[[days-]hours:]minutes[:seconds]".
If you know the runtime of your job before-hand it's a good idea to use this option to specify it as this may result in an earlier start of your job.
TODO: see "Backfilling" below???
--mem=size[K|M|G|T] Specify the memory required per node.

--mem, --mem-per-cpu and --mem-per-gpu are mutually exclusive.
Without a unit values are in megabytes.

--mem-per-cpu=size[K|M|G|T] Specify the memory required per cpu.
--mem-per-gpu=size[K|M|G|T] Specify the memory required per GPU.
-n number or --ntasks=number Request resources for number tasks.
Without further options (see below) the tasks are placed on free resources on any node (nodes are "filled up").
-N n[-m] or --nodes=n[-m]
Request at least n and up to m nodes with ntasks each. If only one number is given (and not a range) it is interpreted as exactly this number of nodes.
-c n or --cpus-per-task=n Tell Slurm that each task will require n CPUs per task. Default is one CPU per task.
TODO stack=size[m|g] 
STACK SIZE wird vom System übernommen!
Standard ist 8 MB. Global ändern? Prolog=... und PrologFlags=Alloc verwenden (wird das auf steps/srun vererbt? auf allen Knoten ausgeführt?)?
Job Arrays
-a 0-n[:step][%maxrunning] or --array=0-n[:step][%maxrunning] Trivial parallelisation using a job array. This will start independent instances of your job (so called "array tasks"). When the tasks are run, each one has the environment variables
set who's values are the total number of tasks your array consists of, the ID of the current task, the last ID, the first ID and the step (increment value) of the IDs of the array respectively.

Appending %maxrunning to the array specification allows you to specify a maximum number of simultaneously running tasks. E.g.
-a 0-9%4 will run ten tasks in total but only a maximum of four simultaneously.

Instead of a range of IDs you can also give a comma separated list of values.

The minimum task-ID is 0, the maximum is 75000.

If the number of your job instances is substantially higher than about 10 please do not use ARRAY_TASKS in --mail-type (see above).

Job submission examples

Parallel MPI jobs

The contents of your job script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)


# Name of your job.
#SBATCH -J name

# Send status information to this email address. 

# Send an e-mail when the job has finished or failed. 
#SBATCH --mail-type=END,FAIL

# In this example we are allocating ressources for 100 MPI processes/tasks,
# placing exactly 10 tasks on each of 10 separate nodes like this:
#SBATCH --ntasks-per-node=10
#SBATCH --nodes=10
#SBATCH --cpus-per-task=1

# This is the same as above but leaving the decision about the
# number of tasks on how many nodes to Slurm (aka "fillup"):
## #SBATCH --ntasks=100
## #SBATCH --cpus-per-task=1

# Specify the amount of memory given to each MPI process
# in the job.
#SBATCH --mem-per-cpu=1G

module purge
module load openmpi/xx.yy.zz

mpirun -np $SLURM_NTASKS ./your_mpi_executable [extra arguments]
Parallel OpenMP jobs

The contents of your job script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)


# Name of your job.
#SBATCH -J name

# Send status information to this email address. 

# Send an e-mail when the job has finished or failed. 
#SBATCH --mail-type=END,FAIL

# Allocate one task on one node and six cpus for this task
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6

# Allocate 12 Gigabytes for the whole node/task
#SBATCH --mem=12G

# tell OpenMP how many threads to start

Important: We have configured Slurm to use control groups in order to limit access of your job to memory and cpus.
If your job uses shared memory parallelization other than OpenMP, you should check that the number of CPU-intensive threads is consistent with the number of slots assigned to the job.
If you start more threads than you requested in the --cpus-per-task directive, these will be forced to run on the requested amount of cores so they will interfere with each other, possibly degrading the overall efficiency of your job.

Interactive jobs (srun --pty)

The submission of interactive jobs is useful in situations where a job requires some sort of direct intervention.

This is usually the case for X-Windows applications or in situations in which further processing depends on your interpretation of immediate results. A typical example for both of these cases is a graphical debugging session.

Note: Interactive sessions are also particularly helpful for getting acquainted with the system or when building and testing new programs.

Interactive sessions might be sequential or parallel ones:

(one CPU on one node)
srun --pty bash
Parallel (shared memory)
(n CPUs on one node)
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=n --pty bash
Parallel (distributed memory)
(n CPUs on each of m nodes)
srun --ntasks-per-node=n --nodes=m --cpus-per-task=1 --pty bash
srun --ntasks=x --pty bash
x being n * m.

In a multi-node parallel (aka distributed memory) interactive session you can use srun (or after loading an MPI module mpirun) to run programs on all nodes.

For using an X-Windows application you have to supply --x11 as a further option, e.g. like this:

srun --pty --x11 xterm -ls

Prepare your session as needed, e.g. by loading all necessary modules within the provided xterm and then start your program on the executing node.

Note: Make sure to end your interactive session (logging out or closing the xterm window) as soon as it is no longer needed!

Monitoring jobs (squeue and sacct)

To get information about running or waiting jobs use

squeue [options]
sq [options]

The command squeue displays a list of all running or waiting jobs of all users.

We have configured your interactive shell in a way that additionally the command sq is available which displays more fields than squeue does by default.
The (in our opinion) most interesting additional field is START_TIME which for pending jobs shows the date and time when Slurm plans to run this job. There is always the possibility that a job will start earlier but not (much) later.
Slurm calculates this field only once per minute so it might not contain a meaningful value right after submission of a job.

squeue and sq display the jobs of all users which might not always be what you want. So we created another shortcut for you:
which is a shorter way of typing sq -u $USER and thus lists only the jobs belonging to you.

To get information about past jobs use

sacct -X [options]


Altering jobs (scontrol update)

You can change the configuration of pending jobs with

scontrol update job jobid SETTING=VALUE [...]

To find out which settings are available we recommend to first run
scontrol show job jobid.

If then for example you want to change the run-time limit of your job to let's say three hours you would use
scontrol update job jobid TimeLimit=3:00

Deleting jobs (scancel)

To delete pending or running jobs you have to look up their numerical job identifiers aka job-ids. You can e.g. use squeue or sq and take the value(s) from the JOBID column.
Then execute

scancel job-id [...]

and the corresponding job will be removed from the wait queue or stopped if it's running.
You can specify more than one job identifier after scancel.
If you want to cancel all your pending and running jobs without being asked for confirmation you may use
sq -h -o %i | xargs scancel.
Those options tell squeue to only output the JOBID column (-o %i) and to omit the column header (-h).

Nach oben scrollen