Slurm tutorial

To use the compute nodes in our upcoming HPC cluster LEO5 and in our teaching only cluster LCC2, you submit jobs to the batch job scheduling system Slurm (unlike in all other clusters that still use SGE). As a user you may want to become familiar with the following commands: sbatch, srun, scancel, squeue, sacct and scontrol, which are briefly described in the following sections.

For more information see the respective man pages of these commands. Detailed information may be found in the Slurm documentation pages.

For those of you who want to re-use their work we have installed a tool to convert SGE-job-scripts to their Slurm equivalent.
You call it like this:

/usr/site/hpc/bin/sge2slurm.py sge-jobscript > slurm-jobscript

Important differences between Slurm and SGE

Essentially SGE and Slurm are very similar as both are a batch scheduler. But apart from the different commands and their options there are these important differences between our SGE-setup and our Slurm configuration:
  • Memory: the memory-related options of Slurm apply to the "real" memory your programs are using (aka RSS). In SGE your asked for virtual memory. As for some programs virtual memory needs are much higher than what they really need as RSS you might be able to reduce your memory reservations considerably.
  • Resource-over-consumption: we configured Slurm in a way that when your program uses more resources (CPUs and memory) than you requested in your job-options it will behave differently than SGE:
    • If a program exceeds its memory limit it will contiue to run but part of its memory will be swapped out to or in from disk. This will extremly slow down things so you should definitely avoid this.
    • If a program spawns more processes or threads than your requested number of cores it will be limited to the requested number of cores. This means it will not "steal" cores from other jobs on the same node.

Submitting jobs (sbatch)

The command sbatch is used to submit jobs to the batch-system using the following syntax:

sbatch [options] [job_script [ job_script_arguments ...]]

where job_script represents the path to (or if it's in the current working directory the name of) a simple shell script containing the commands to be run on the cluster nodes. If no file is specified, sbatch will read a script from standard input.

The first line of this script needs to start with #! followed by the path to an interpreter. For instance #!/bin/sh or #!/bin/bash (or any other available shell of your taste) but note that we currently only support Bash (/bin/bash). Your script may use common Bash functionality such as I/O redirection using the < and > characters, loops, case constructs etc., but please keep it simple. If your setup uses a different shell or needs a complex script, simply call your script from within the batch job script.

Slurm will start your job on any nodes that have the necessary resources available or put your job in a waiting queue until requested resources become available. Slurm uses recent consumption of resources in deciding which waiting jobs to start first ("fair share" scheduling). After submitting your job, you may continue to work or log out - job scheduling is completely independent of interactive work. The only way to stop a job after it has been submitted is to use the scancel command described below.

If you submit more than one job at the same time, you need to make sure that individual jobs (which may be executed simultaneously) do not interfere with each other, e.g. by writing to the same files.

The options tell the sbatch command how to behave: job name, use of main memory and run time, parallelization method, etc.

There are two ways of supplying these options to the sbatch command:

Method 1:

You may add the options directly to the sbatch command line, like:

sbatch -J job_name --ntasks=number_of_tasks --ntasks-per-node=number_of_tasks_per_node --cpus-per-task=number_of_cpus_per_task --mem-per-cpu=memory_per_cpu job_script [ argument ... ]

Method 2 (recommended):

Add the sbatch options to the beginning of your job_script, one option per line.

Note that the lines prefixed with #SBATCH are parsed by the sbatch command, but are treated as comments by the shell.

Taking above example, the contents of job_script would look like:

#!/bin/bash

#SBATCH -J job_name   #SBATCH --job-name=job_name is possible as well
#SBATCH --ntasks=number_of_tasks
#SBATCH --ntasks-per-node=number_of_tasks_per_node
#SBATCH --cpus-per-task=number_of_cpus_per_task
#SBATCH --mem-per-cpu=memory_per_cpu

./your_commands

If you give conflicting options both in the job file and the sbatch command line, the command line options take precedence. So you can use options in the job script to supply defaults that may be overriden in the sbatch command line.

Overview of commonly used options to sbatch

Job Name, Input, Output, Working Directory
-J name or --job-name=name Name of the job.
Default: File name of the job script.
The job name is used in the default output of squeue (see below) and may be used in the filename pattern of input-, output- and error-file.
By default both standard output and standard error are directed to the same file. For normal jobs the default file name is slurm-%j.out, where "%j" is replaced by the job ID. For job arrays, the default file name is slurm-%A_%a.out, "%A" is replaced by the job ID and "%a" by the array index.
The working directory of the job script is the directory where sbatch was called unless the --chdir argument is given.
-o filename pattern or --output=filename pattern
Standard output of the job script will be connected to the file specified by filename pattern.
Filename patterns may use the following place-holders (for a full list see the documentation of sbatch):
  • %x Job name.
  • %j Job-ID.
  • %t Task identifier (aka rank). This will create a seperate file per task.
  • %N Short hostname. This will create a separate file per node.
Example: -o %x_%j_%N.out
-e filename pattern or --error=filename pattern Standard error of the job script will be connected to the file specified by filename pattern.
-i filename pattern or --input=filename pattern Standard input of the job script will be connected to the file specified by filename pattern. By default, "/dev/null" is connected to the script's standard input.
-D directory or --chdir=directory Execute the job in the specified working directory. Input/output file names are relative to this directory.
Notification
--mail-user=email address Notifications will be sent to this email address.
Default is to send mails to the local user submitting the job.
--mail-type=[TYPE|ALL|NONE] Send notifications for the specified type of events (default: NONE).
Possible values for TYPE are BEGIN, END, FAIL, REQUEUE, STAGE_OUT, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS. Multiple types may be specified by using a comma speparated list.
ALL is equivalent to BEGIN,END,FAIL,REQUEUE,STAGE_OUT.
TIME_LIMIT sends a notification when the time limit of the running job is reached. TIME_LIMIT_XX sends one when XX percent of the limit are reached.
When ARRAY_TASKS is specified BEGIN, END and FAIL apply to each task in the job array (we strongly advise against using this)! Without this, these messages are sent for the array as a whole.
Resources
-t time or --time=time Set a limit on the run time (wallclock time from start to termination) of the job. The default (=maximum) depends on the used partition.
When the time limit is reached, each task in each job step receives a TERM signal followed (after 30 seconds) by a KILL signal. So "trapping" the TERM signal and gracefully shutting down the script is possible.
Times may be specified as "minutes", "days-hours" or "[[days-]hours:]minutes[:seconds]".
If you know the runtime of your job before-hand it's a good idea to use this option to specify it as this may result in an earlier start of your job.
TODO: see "Backfilling" below???
--mem=size[K|M|G|T] Specify the memory required per node.

--mem-per-cpu=size[K|M|G|T] Specify the memory required per CPU.

--mem-per-gpu=size[K|M|G|T] Specify the memory required per GPU.

-n number or --ntasks=number Request resources for number tasks.
Without further options (see below) the tasks are placed on free resources on any node (nodes are "filled up").
-N n[-m] or --nodes=n[-m]
and
--ntasks-per-node=ntasks
Request at least n and up to m nodes with ntasks each. If only one number is given (and not a range) it is interpreted as exactly this number of nodes.
-c n or --cpus-per-task=n Tell Slurm that each task will require n CPUs. Default is one CPU per task.
Job Arrays
-a 0-n[:step][%maxrunning] or --array=0-n[:step][%maxrunning] Trivial parallelisation using a job array. This will start independent instances of your job (so called "array tasks"). When the tasks are run, each one has the environment variables
SLURM_ARRAY_TASK_COUNT
SLURM_ARRAY_TASK_ID
SLURM_ARRAY_TASK_MAX
SLURM_ARRAY_TASK_MIN
SLURM_ARRAY_TASK_STEP
set who's values are the total number of tasks your array consists of, the ID of the current task, the last ID, the first ID and the step (increment value) of the IDs of the array respectively.

Appending %maxrunning to the array specification allows you to specify a maximum number of simultaneously running tasks. E.g.
-a 0-9%4 will run ten tasks in total but only a maximum of four simultaneously.

Instead of a range of IDs you can also give a comma separated list of values.

The minimum task-ID is 0, the maximum is 75000.

If the number of your job instances is substantially higher than about 10 please do not use ARRAY_TASKS in --mail-type (see above).

Job script validation and start estimate
--test-only The job script is validated but not submitted. Additionally an estimate is shown of when a job would be scheduled to run with the current settings given in the job script and on the command line.
A note on maximum stack size

Our systems are configured in a way that the maximum allowed size of the stack of your programs is unlimited (unlike the default in most Linux systems where it is limited to 8 MB). Most programs will not need this but some will benefit from it.
There are edge-cases where (FORTRAN?) programs will not work with an unlimited stack size. In that case please limit stack size in your job-script before calling that program. With e.g.

ulimit -s 80000

you will set the limit to about 80 MB (80000 kB). This works because as a user you are allowed to lower the limit anytime.

Job submission examples

Parallel MPI jobs

The contents of your job script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash

# Name of your job.
#SBATCH -J name

# Send status information to this email address. 
#SBATCH --mail-user=Karl.Mustermann@xxx.com

# Send an e-mail when the job has finished or failed. 
#SBATCH --mail-type=END,FAIL

# In this example we are allocating ressources for 100 MPI processes/tasks,
# placing exactly 10 tasks on each of 10 separate nodes like this:
#SBATCH --ntasks-per-node=10
#SBATCH --nodes=10
#SBATCH --cpus-per-task=1

# This is the same as above but leaving the decision about the
# number of tasks on how many nodes to Slurm (aka "fillup"):
## #SBATCH --ntasks=100
## #SBATCH --cpus-per-task=1

# Specify the amount of memory given to each MPI process
# in the job.
#SBATCH --mem-per-cpu=1G

module purge
module load openmpi/xx.yy.zz

mpirun -np $SLURM_NTASKS ./your_mpi_executable [extra arguments]
Parallel OpenMP jobs

The contents of your job script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash

# Name of your job.
#SBATCH -J name

# Send status information to this email address. 
#SBATCH --mail-user=Karla.Musterfrau@xxx.com

# Send an e-mail when the job has finished or failed. 
#SBATCH --mail-type=END,FAIL

# Allocate one task on one node and six cpus for this task
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6

# Allocate 12 Gigabytes for the whole node/task
#SBATCH --mem=12G

# tell OpenMP how many threads to start
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./your_openmp_executable

Important: We have configured Slurm to use control groups in order to limit access of your job to memory and cpus.
If your job uses shared memory parallelization other than OpenMP, you should check that the number of CPU-intensive threads is consistent with the number of slots assigned to the job.
If you start more threads than you requested in the --cpus-per-task directive, these will be forced to run on the requested amount of cores so they will interfere with each other, possibly degrading the overall efficiency of your job.

Interactive jobs (srun --pty)

The submission of interactive jobs is useful in situations where a job requires some sort of direct intervention.

This is usually the case for X-Windows applications or in situations in which further processing depends on your interpretation of immediate results. A typical example for both of these cases is a graphical debugging session.

Note: Interactive sessions are also particularly helpful for getting acquainted with the system or when building and testing new programs.

Interactive sessions might be sequential or parallel ones:

Sequential
(one CPU on one node)
srun --pty bash
Parallel (shared memory)
(n CPUs on one node)
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=n --pty bash
Parallel (distributed memory)
(n CPUs on each of m nodes)
Either
srun --ntasks-per-node=n --nodes=m --cpus-per-task=1 --pty bash
or
srun --ntasks=x --pty bash
x being n * m.

In a multi-node parallel (aka distributed memory) interactive session you can use srun (or after loading an MPI module mpirun) to run programs on all nodes.

For using an X-Windows application you have to supply --x11 as a further option, e.g. like this:

srun --pty --x11 xterm -ls

Prepare your session as needed, e.g. by loading all necessary modules within the provided xterm and then start your program on the executing node.

Note: Make sure to end your interactive session (logging out or closing the xterm window) as soon as it is no longer needed!

Monitoring jobs (squeue and sacct)

To get information about running or waiting jobs use

squeue [options]
or
sq [options]

The command squeue displays a list of all running or waiting jobs of all users.

We have configured your interactive shell in a way that additionally the command sq is available which displays more fields than squeue does by default.
The (in our opinion) most interesting additional field is START_TIME which for pending jobs shows the date and time when Slurm plans to run this job. There is always the possibility that a job will start earlier but not (much) later.
Slurm calculates this field only once per minute so it might not contain a meaningful value right after submitting a job.

squeue and sq display the jobs of all users which might not always be what you want. So we created another shortcut for you:
squ
which is a shorter way of typing sq -u $USER and thus lists only the jobs belonging to you.

To get information about past jobs use

sacct -X [options]

TODO:
Ausführlicher!

Altering jobs (scontrol update)

You can change the configuration of pending jobs with

scontrol update job jobid SETTING=VALUE [...]

To find out which settings are available we recommend to first run
scontrol show job jobid.

If then for example you want to change the run-time limit of your job to let's say three hours you would use
scontrol update job jobid TimeLimit=3:00

Some adaption might require you to change more than one setting. If e.g. your Job is flexible wrt to the number of used tasks and nodes and you want to change those after having submitted a job you would have to run
scontrol update job jobid NumTasks=xx NumCPUs=xx NumNodes=y

Deleting jobs (scancel)

To delete pending or running jobs you have to look up their numerical job identifiers aka job-ids. You can e.g. use squeue or squ and take the value(s) from the JOBID column.
Then execute

scancel job-id [...]

and the corresponding job will be removed from the wait queue or stopped if it's running.
You can specify more than one job identifier after scancel.
If you want to cancel all your pending and running jobs without being asked for confirmation you may use
squ -h -o %i | xargs scancel.
Those options tell squeue to only output the JOBID column (-o %i) and to omit the column header (-h).

Nach oben scrollen