Slurm tutorial
To use the compute nodes in our Slurm clusters (HPC Cluster LEO5 and teaching cluster LCC2), you submit jobs to the batch job scheduling system Slurm (unlike LEO3e and LEO4, which still use SGE).
As a user you may want to become familiar with the following commands: sbatch, srun, scancel, squeue, sacct and scontrol, which are briefly described in the following sections.
For more information see the respective man pages of these commands. Detailed information may be found in the Slurm documentation pages.
For those of you who want to re-use their work we have installed a
tool to convert SGE-job-scripts to their Slurm equivalent.
You call it like this:
/usr/site/hpc/bin/sge2slurm.py sge-jobscript.sge > slurm-jobscript.slurm
Important differences between Slurm and SGE
Essentially SGE and Slurm are very similar as both are batch schedulers. But apart from the different commands and their options (described below) there are these important design differences between our SGE-setup and our Slurm configuration:
- Memory: the memory-related options of Slurm apply to the amount of "real" memory programs are using (aka RSS). In SGE you request virtual memory (-l h_vmem). As a result when moving to Slurm you might be able to reduce your values for memory reservations considerably since some programs need much more virtual memory than what they really need as RSS.
- Resource-over-consumption: we configured Slurm in
a way that when your program uses more resources (CPUs and memory)
than you requested in your job-options it will behave in a different
way from SGE:
- If a program exceeds its memory limit, it will not be terminated but it will continue to run. However, parts of its memory will be paged out (and in) to disk which will extremly slow things down, so you should definitely avoid this. (??? Notification planned ???)
- Slurm uses so called control groups to confine your programs to the requested number of cores (per node). I.e. if one of your program spawns more processes or threads then your requested number of cores it will not impair the performance of other jobs on the same node, but your job may run more slowly than expected.
- The default run time on LEO5 is 3 days and may be expanded to 10 days if needed. Please note that excessive run times negatively affect overall responsitvity of the cluster.
Submitting jobs (sbatch)
The command sbatch is used to submit jobs to the batch-system using the following syntax:
sbatch [options] [job_script.slurm [ job_script_arguments ...]]
where job_script.slurm represents the (relative or absolute) path to a simple shell script containing the commands to be run on the cluster nodes. We recommend to use the suffix .slurm to distinguish from scripts intended for other uses. If no file is specified, sbatch will read a script from standard input.
The first line of this script needs to start with #! followed by the path to an interpreter. For instance #!/bin/sh or #!/bin/bash (or any other available shell of your taste) but note that we currently only support Bash (/bin/bash). Your script may use common Bash functionality such as I/O redirection using the < and > characters, loops, case constructs etc., but please keep it simple. If your setup uses a different shell or needs a complex script, simply call your script from within the batch job script.
Slurm will start your job on any nodes that have the necessary resources available or put your job in a waiting queue until requested resources become available. Slurm uses recent resource consumption in deciding which waiting jobs to start first (fair share scheduling). After submitting your job, you may continue to work or log out - job scheduling is completely independent of interactive work. The only way to stop a job after it has been submitted is to use the scancel command described below.
If you submit more than one job at the same time, you need to make sure that individual jobs (which may be executed simultaneously) do not interfere with each other by e.g. writing to the same files.
The options tell the sbatch command how to behave: job name, use of main memory and run time, parallelization method, etc.
There are two ways of supplying these options to the sbatch command:
Method 1:
You may add the options directly to the sbatch command line, like:
sbatch --job-name=job_name --ntasks=number_of_tasks --cpus-per-task=number_of_cpus_per_task --mem-per-cpu=memory_per_cpu job_script.slurm [ argument ... ]
Method 2 (recommended):
Add the sbatch options to the beginning of your job_script, one option per line.
Note that the lines prefixed with #SBATCH are parsed by the sbatch command, but are treated as comments by the shell.
Taking above example, the contents of job_script.slurm would look like:
#!/bin/bash #SBATCH --job-name=job_name #SBATCH --ntasks=number_of_tasks #SBATCH --cpus-per-task=number_of_cpus_per_task #SBATCH --mem-per-cpu=memory_per_cpu ./your_commands |
If you give conflicting options both in the job file and the sbatch command line, the command line options take precedence. So you can use options in the job script to supply defaults that may be overriden in the sbatch command line.
Overview of commonly used options to sbatch
Job Name, Input, Output, Working Directory |
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
--job-name=name |
Name of the job. Default: File name of the job script. The job name is used in the default output of squeue (see below) and may be used in the filename pattern of input-, output- and error-file. |
||||||||||||
--output=filename_pattern |
Standard output of the job script will be connected to the
file specified by filename_pattern. By default both standard output and standard error are directed to the same file. For normal jobs the default file name is slurm-%j.out, where "%j" is replaced by the job ID. For job arrays, the default file name is slurm-%A_%a.out, "%A" is replaced by the job ID and "%a" by the array index. The working directory of the job script is the current working directory (where sbatch was called) unless the --chdir argument is given. Filename patterns may use the following place-holders (for a full list see the documentation of sbatch):
please note: two jobs using the same output file name will clobber each other's output. Use the default or make sure that your filename_pattern includes %j. |
||||||||||||
--error=filename_pattern |
Standard error of the job script will be connected to the file specified by filename_pattern as described above. | ||||||||||||
--input=filename pattern |
Standard input of the job script will be connected to the file specified by filename pattern. By default, "/dev/null" is connected to the script's standard input. | ||||||||||||
--chdir=directory |
Execute the job in the specified working directory.
Input/output file names are relative to this directory. Default: current working directory of sbatch-command. |
||||||||||||
Notifications |
|||||||||||||
--mail-user=email_address |
Notifications will be sent to this email
address. Default is to send mails to the local user submitting the job. |
||||||||||||
--mail-type=[TYPE|ALL|NONE] |
Send notifications for the specified type of events (default: NONE). Possible values for TYPE are BEGIN, END, FAIL, REQUEUE, STAGE_OUT, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS. Multiple types may be specified by using a comma speparated list. ALL is equivalent to BEGIN,END,FAIL,REQUEUE,STAGE_OUT. TIME_LIMIT sends a notification when the time limit of the running job is reached. TIME_LIMIT_XX sends one when XX percent of the limit are reached. When ARRAY_TASKS is specified BEGIN, END and FAIL apply to each task in the job array (we strongly advise against using this)! Without this, these messages are sent for the array as a whole. |
||||||||||||
Time Limits |
|||||||||||||
--time=time |
Set a limit on the run time (wallclock time from start to
termination) of the job. The default depends on
the used partition. When the time limit is reached, each task in each job step receives a TERM signal followed (after 30 seconds) by a KILL signal. So "trapping" the TERM signal and gracefully shutting down the script is possible. Times may be specified as: minutes, days-hours, or [[days-]hours:]minutes[:seconds]. If you know the runtime of your job before-hand it's a good idea to use this option to specify it as this helps the scheduler doing its resource planning and may result in an earlier start of your job. TODO: see "Backfilling" below??? |
||||||||||||
Memory Allocation |
|||||||||||||
--mem=size[K|M|G|T] |
Specify the memory required per node.
|
||||||||||||
--mem-per-cpu=size[K|M|G|T] |
Specify the memory required per CPU.
|
||||||||||||
--mem-per-gpu=size[K|M|G|T] |
Specify the memory required per GPU.
|
||||||||||||
Nodes, Tasks, and CPUs |
|||||||||||||
--ntasks=ntasks |
Request resources for a total number of ntasks tasks. Without further options (see below) the tasks are placed on free resources on any node (nodes are "filled up"). For MPI jobs, tasks map to MPI ranks. Slurm will set the environment variable $SLURM_NTASKS to the number ntasks that you requested. |
||||||||||||
--nodes=n[-m] --ntasks-per-node=ntasks |
Request at least n and up to m nodes with
ntasks each.
If only one number is given (and not a range) it is
interpreted as exactly this number of
nodes. Please note: Unless you have a good reason to explicitly control placement of tasks, do not use these options, but for best results let the system decide. |
||||||||||||
--cpus-per-task=ncpus |
Tell Slurm that each task will require ncpus CPUs.
Default is one CPU per task. This is the level at which
multithreading (e.g. Posix threads or OpenMP threads) is
specified. Slurm will set the environment variable $SLURM_CPUS_PER_TASK to the number ncpus that you requested. MPI + OpenMP hybrid jobs are natively supported by simultaneously setting ntasks and ncpus to values greater than 1. |
||||||||||||
GPUs |
|||||||||||||
--gpus=[type:]number |
Request number GPUs, optionally of type type. On LEO5, type is one of a30 or a40. GPU nodes have two GPUs installed on each node. | ||||||||||||
Job Arrays |
|||||||||||||
--array=m-n[:step][%maxrunning] |
Trivial parallelisation using a job array.
This will start n-m+1 independent instances of
your job (so called "array tasks") with a task ID ranging from
m to n inclusive.
At run time, each task
has the following environment variables set:
Appending %maxrunning to the array specification allows
you to specify a maximum number of simultaneously running tasks.
E.g. Instead of a range of IDs you can also give a comma separated list of values. The minimum task-ID is 0, the maximum is 75000. If the number of your job instances is substantially higher than about 10 please do not use ARRAY_TASKS in --mail-type (see above). |
||||||||||||
Job script validation and start estimate |
|||||||||||||
--test-only |
The job script is validated but not submitted. Additionally an estimate is shown of when a job would be scheduled to run with the current settings given in the job script and on the command line. | ||||||||||||
A note on maximum stack size
Our systems are configured in a way that the maximum allowed size of
the stack of your programs is unlimited (unlike the default in most Linux
systems where it is limited to 8 MB). Most programs will not need this
but some will benefit from it.
There are edge-cases where (FORTRAN?) programs will not work with an unlimited
stack size. In that case please limit stack size in
your job-script before calling that program. With e.g.
ulimit -s 80000
you will set the limit to about 80 MB (80000 kB). This works because as a user you are allowed to lower the limit anytime.
Job submission examples
Parallel MPI job
The contents of your job script may look like this:
(if you just copy&paste this example please be
aware of line breaks and special characters)
#!/bin/bash # Name of your job. #SBATCH --job-name=name # Send status information to this email address. #SBATCH --mail-user=Karl.Mustermann@xxx.com # Send an e-mail when the job has finished or failed. #SBATCH --mail-type=END,FAIL # Start an MPI job with 80 single threaded tasks #SBATCH --ntasks=100 #SBATCH --cpus-per-task=1 # In this example we allocate ressources for 80 MPI processes/tasks, # placing exactly 10 tasks on each of 8 separate nodes like this: ## #SBATCH --ntasks-per-node=10 ## #SBATCH --nodes=8 ## #SBATCH --cpus-per-task=1 # do this only when you have good reason to explicitly control # task placement # Specify the amount of memory given to each MPI process # in the job. #SBATCH --mem-per-cpu=1G module purge module load openmpi/xx.yy.zz mpirun -n $SLURM_NTASKS ./your_mpi_executable [extra arguments] |
Parallel OpenMP jobs
The contents of your job script may look like this:
(if you just copy&paste this example please be
aware of line breaks and special characters)
#!/bin/bash # Name of your job. #SBATCH --job-name=name # Send status information to this email address. #SBATCH --mail-user=Karla.Musterfrau@xxx.com # Send an e-mail when the job has finished or failed. #SBATCH --mail-type=END,FAIL # Allocate one task on one node and six cpus for this task #SBATCH --ntasks=1 #SBATCH --cpus-per-task=6 # Allocate 12 Gigabytes for the whole node/task #SBATCH --mem=12G # tell OpenMP how many threads to start export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./your_openmp_executable |
Important:
We have configured Slurm to use control groups in order to limit
access of your job to memory and cpus.
If your job uses shared memory parallelization other than OpenMP, you
should check that the number of CPU-intensive threads is consistent
with the number of slots assigned to the job.
If you start
more threads than you requested in the --cpus-per-task
directive, these will be forced to run on the requested amount of cores
so they will interfere with each other, possibly degrading the overall
efficiency of your job.
Interactive jobs (srun --pty)
The submission of interactive jobs is useful in situations where a job requires some sort of direct intervention.
This is usually the case for X-Windows applications or in situations in which further processing depends on your interpretation of immediate results. A typical example for both of these cases is a graphical debugging session.
Note: Interactive sessions are also particularly helpful for getting acquainted with the system or when building and testing new programs.
Interactive sessions might be sequential or parallel:
Sequential (one CPU on one node) |
srun --pty bash |
Parallel (shared memory) (n CPUs on one node) |
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=n --pty bash |
Parallel (distributed memory) (n CPUs on each of m nodes) |
Either srun --ntasks-per-node=n --nodes=m --cpus-per-task=1 --pty bash or srun --ntasks=x --pty bash x being n * m. |
In a multi-node parallel (aka distributed memory) interactive session you can use srun (or after loading an MPI module mpiexec) to run programs on all nodes.
For using an X-Windows application, supply --x11 as a further option, e.g. like this:
srun --pty --x11 xterm -ls
Prepare your session as needed, e.g. by loading all necessary modules within the provided xterm and then start your program on the executing node.
Note: Make sure to end your interactive session (logging out or closing the xterm window) as soon as it is no longer needed!
Monitoring jobs (squeue, srun and sacct)
To get information about running or waiting jobs use
squeue [options]
or
sq [options]
The command squeue displays a list of all running or waiting jobs of all users.
The locally implemented command sq displays more fields than squeue does by default.
The (in our opinion) most interesting additional field is START_TIME which for pending jobs shows the date and time when Slurm plans to run this job. It is always possible that a job will start earlier but not (much) later.
Slurm calculates this field only once per minute so it might not contain a meaningful value right after submitting a job.
squeue and sq display the jobs of all
users which might not always be what you want. So we created
another shortcut for you:
squ
which is a shorter way of typing sq -u $USER and thus lists
only the jobs belonging to you.
You can further inspect a running job by "connecting" to it with this command:
srun --jobid=jobid --pty bash
This will open an interactive shell as a job step under an already allocated job. I.e. you will be able to see how your job is "behaving". For distributed memory jobs you will get a shell at the first node used by your job.
To get information about past jobs use
sacct -X [options]
TODO: Ausführlicher!
Altering jobs (scontrol update)
You can change the configuration of pending jobs with
scontrol update job jobid SETTING=VALUE [...]
To find out which settings are available we recommend to first run
scontrol show job jobid.
If then for example you want to change the run-time limit of your job to
let's say three hours you would use
scontrol update job jobid TimeLimit=3:00
Some adaption might require you to change more than one setting. If
e.g. your Job is flexible wrt to the number of used tasks and nodes and
you want to change those after having submitted a job you would have to
run
scontrol update job jobid NumTasks=xx NumCPUs=xx
NumNodes=y
Deleting jobs (scancel)
To delete pending or running jobs you have to look up their numerical
job identifiers aka job-ids. You can e.g. use
squeue or squ and take the value(s) from the
JOBID column.
Then execute
scancel job-id [...]
and the corresponding job will be removed from the wait queue or
stopped if it's running.
You can specify more than one job
identifier after scancel.
If you want to cancel all your
pending and running jobs without being asked for
confirmation you may use
squ -h -o %i | xargs scancel.
Those options tell squeue to only
output the JOBID column (-o %i) and to omit the column header
(-h).