SGE tutorial

Submitting jobs (qsub)

To use the compute nodes in our HPC clusters, you submit jobs to the batch job scheduling system SGE (formerly Sun Grid Engine, now its open source successor Son of Grid Engine). As a user you may want to become familiar with the following commands: qsub, qstat, qdel, qhost and qsh, which are briefly described in the following sections.

For more information see the respective man pages of these commands. Detailed information may be found at the Son of Grid Engine project page.

The command qsub is used to submit jobs to the batch-system. qsub uses the following syntax:

qsub [ -q std.q ] [options] job_script [ job_script_arguments ...]

where job_script represents the path to either a binary (in which case the qsub option -b y is required) or (preferably) a simple shell script containing the commands to be run on the remote cluster nodes.

The -q option specifies the name of the queue to be used. As of December 2016, std.q ist the default. You may list the available queues with qconf -sql.

The SGE will start your job on any nodes that have the necessary resources available or put your job in a queue until requested resources become available. SGE uses recent consumption of resources in deciding which queued jobs to start first ("fair share" scheduling). After submitting your job, you may continue to work or log out - job scheduling is completely independent of interactive work. The only way to stop a job after it has been submitted is to use the qdel command described below.

If you submit more than one job at the same time, you need to make sure that individual jobs (which may be executed simultaneously) do not interfere with each other, e.g. by writing to the same files.

Note that we currently only support Bash (/bin/bash). Your script may use common Bash functionality such as I/O redirection using the < and > characters, loops, case constructs etc., but please keep it simple. If your setup uses a different shell or needs a complex script, simply call your script from within the batch job script.

The options tell the qsub command how to behave: job name, where output is written, use of main memory and run time, parallelization method, etc.

There are two ways to supply options to the qsub command:

Method 1:

You may add the options directly to the qsub command line, like:

qsub -cwd -q std.q -N job_name job_script [ argument ... ]

Method 2 (recommended):

Add the qsub options to the beginning of your job_script, one option per line. These are automatically added to qsub to the qsub command line

qsub job_script [ argument ... ]

Note that the lines prefixed with #$ are parsed by the qsub command, but are treated as comments by the shell.

Taking above example, the contents of job_script would look like:

#!/bin/bash

#$ -q std.q
#$ -cwd
#$ -N job_name

./your_commands

If you give conflicting options both in the job file and the qsub command line, the command line options take precedence (exception: "-M" options accumulate). So you can use options in the job script to supply defaults that may be overriden in the qsub command line.

Overview of commonly used options to qsub

Queue

-q queuename

Select queue.
Get a listing of queues with qconf -sql,
display queue parameters with qconf -sq queuename.
Currently the following queues are defined:

std.q: This is the default. General purpose queue. Default/maximum runtime: 240 hours.
short.q: For small test jobs. Limited number of CPU slots. Default/maximum runtime: 10 hours.
bigmem.q: LEO3e and LEO4 only. Jobs with high main memory requirements. Will run on the nodes equipped with 512GB of memory. Default/maximum runtime: 240 hours.
mem3000.q: LEO4 only. Jobs with very high memory requirements. Will run on the 3TB node, which has 80 cores installed. This also means that you have 38 GB available per job slot.
gpustd.q: LEO4 only. Run the job on the GPU queue. Default/maximum run time: 96 hours. For details, see the GPU Node Documentation.

Job Name, Input, Output

-N name

Name the job.
Default: File name of script.
The job name is also reflected in the default file names for standard output and standard error (see below). We recommend to use a unique name, which makes cleanup of output files much easier.

The following two options (output redirection) have sensible defaults:
-o opath
-e epath

Standard output/error will be appended to file opath or epath, respectively.
Using these options is not recommended, in particular if you plan to start more than one job (interleaving and possible loss out output data).
Default:
name.ojob_id and name.ejob_id
Here, name is the job's name (see above), and the unique job_id is automatically created by the system for each job.

-j yes|no

Join standard error to standard output (yes or no; default: no)

-i path

Standard input file

-cwd

execute job in current working directory. If you omit this option, you job will execute in $HOME, which is usually a bad idea. Input/output file names are relative to this directory.

Notification

-M email.address@uibk.ac.at

Notifications will be sent to this email address. Note that external mail servers may reject these messages because they originate from the compute nodes, whose addresses cannot be resolved from outside the UIBK network.

-m [b|e|a|s|n]

notifications for any combination of the following events:
begin, end, abort, suspend, no mail (default)
Do not forget to specify an email address (with -M) if you want to get these notifications.
Mails sent by SGE may contain diagnostic information useful in the testing phase. Disable before submitting large numbers of production jobs.

Resources

-l h_rt=[hours:minutes:]seconds

requested real time (wallclock time from start to termination; the default (=maximum) depends on the system and, if applicable, the specified queue.

-l h_vmem=size[M|G]

request a per slot memory limit of size bytes / megabytes / gigabytes.
I.e. the requested memory in total is size multiplied by the number of requested slots. See the description of parallel environments below.

-l h_stack=size[m|g]

request a per slot stack size limit of size bytes / megabytes / gigabytes. This parameter is typically needed if your programs allocate large amounts of memory on the stack (e.g. large dynamically sized local variables in Fortran programs).

-hold_jid job-id

start job only if the job with the job id job-id has finished

Parallel jobs / parallel environments

-pe parallel-environment number-of-slots

If you run parallelized programs (MPI or shared memory), you need to specify a parallel environment and the number of processes/threads (= SGE slots) on which your parallel (MPI/OpenMP) application should run. By selecting a parallel environment you can also control how jobs are distributed across nodes. For a list of available parallel environments on the system execute:

qconf -spl

If you omit the -pe option, SGE assumes that your job is sequential.

Please note: The -pe option only reserves CPU-cores for your job. You need to make sure that your program actually starts as many processes or threads as you requested.

The following types of parallel-environment are available:

openmpi-Xperhost	Each host gets X processes (number-of-slots must be a multiple of X).
openmpi-fillup	The batchsystem fills up each available host with processes to its host process limit.
openmp	This environment should be chosen when working with threaded applications (e.g. OpenMP).

Job Arrays

-t 1-n
or (to continue a partially completed array)
-t m-n

Trivial parallelisation using a job array. Start n (or n-m+1) almost identical independent instances of your job (e.g. for extensive parameter studies). Individual job instances are started sequentially or concurrently as available resources permit. In your job setup, use the environment variable $SGE_TASK_ID, which is set to a unique integer value from 1 .. n (or m .. n), to distinguish between the individual job instances (use value e.g. as an index into a parameter table, to initialize a random number generator, select an input file, or compute parameter values).
A job array cannot have more than 75000 elements.
If the number of your job instances is substantially higher than about 10 please check that you have disabled mail notification (-m n - see above).

-tc p

optional: limit number of concurrent tasks to p

Other useful options

-w v

check whether the syntax of the job is okay (do not submit the job)

There are differences to consider between the various supported (parallel) programming models. The following examples illustrate the different procedures:

Sequential jobs

qsub job_script

where the contents of job_script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash
#$ -q std.q              # select standard nodes
#$ -N myjob              # output will go to  myjob.ojobid
#$ -l h_rt=10:00:00      # replace by your estimate for run time
#$ -l h_vmem=2G          # allocate 2GB of virtual memory
#$ -j yes                # stdout and stderr will go to  myjob.ojobid
#$ -cwd                  # start job in current working directory

### #$ -M Max.Mustermann@uibk.ac.at     # (debugging) send mail to address...
### #$ -m ea                            # ... at job end and abort

# your modules here
module load mymodule ..

echo STARTED on $(date)

# your commands here
mycommand ....

# display resource consumption
qstat -j $JOB_ID | awk 'NR==1,/^scheduling info:/'

echo FINISHED on $(date)

Parallel MPI jobs

qsub job_script

where the contents of job_scriptjob_script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash
#$ -q std.q              # select standard nodes
#$ -N myjob              # output will go to  myjob.ojobid
#$ -l h_rt=10:00:00      # replace by your estimate for run time
#$ -l h_vmem=2G          # allocate 2GB of virtual memory per MPI process
#$ -pe openmpi-fillup 4  # allocate 4 job slots on any available machines
#$ -j yes                # stdout and stderr will go to  myjob.ojobid
#$ -cwd                  # start job in current working directory

### #$ -M Max.Mustermann@uibk.ac.at     # (debugging) send mail to address...
### #$ -m ea                            # ... at job end and abort

# your modules here
module load openmpi/x.y.z .....

echo STARTED on $(date)

export MKL_NUM_THREADS=1
export OMP_NUM_THREADS=1

# your commands here
mpirun -np $NSLOTS mycommand ....

# display resource consumption
qstat -j $JOB_ID | awk 'NR==1,/^scheduling info:/'

echo  FINISHED on $(date)

Parallel OpenMP jobs

qsub job_script

where the contents of job_script may look like this:
(if you just copy&paste this example please be aware of line breaks and special characters)

#!/bin/bash
#$ -q std.q              # select standard nodes
#$ -N myjob              # output will go to  myjob.ojobid
#$ -l h_rt=10:00:00      # replace by your estimate for run time
#$ -l h_vmem=2G          # allocate 2GB per thread
#$ -pe openmp 4          # allocate 4 job slots
#$ -j yes                # stdout and stderr will go to  myjob.ojobid
#$ -cwd                  # start job in current working directory
#
### #$ -M Max.Mustermann@uibk.ac.at     # (debugging) send mail to address...
### #$ -m ea                            # ... at job end and abort

# your modules here
module load .....

echo STARTED on $(date)

export MKL_NUM_THREADS=$NSLOTS
export OMP_NUM_THREADS=$NSLOTS

# your commands here
mycommand ....

# display resource consumption
qstat -j $JOB_ID | awk 'NR==1,/^scheduling info:/'

echo FINISHED on $(date)

Important: If your job uses shared memory parallelization other than OpenMP, you will still use the -pe openmp environment, but you need to ensure that the number of CPU-intensive threads is consistent with the number of slots assigned to the job ($NSLOTS). If you start more threads than you requested in the -pe directive, these may interfere with other users' processes, possibly degrading the overall efficiency of large parts of the system. Many parallel programs by default automatically discover the number of cores installed on the system and will start as many threads. You will need to find out how to override this behaviour (quite software-dependent).

Note that there is no use asking for more processes than are available on the largest machines in the cluster. This will result in SGE's failure to ever start the job.

Submitting interactive jobs (qsh)

The submission of interactive jobs is useful in situations where a job requires some sort of direct intervention. This is usually the case for X-Windows applications or in situations in which further processing depends on your interpretation of immediate results. A typical example for both of these cases is a graphical debugging session.

Note: Interactive sessions are particularly helpful for getting acquainted with the system or when building and testing new programs.

The only supported method for interactive sessions on the cluster is currently to start an interactive X-Windows session via the SGE's qsh command. This will bring up an xterm from the executing node with the display directed either to the X-server indicated by your actual DISPLAY environment variable or as specified with the -display option. Try qsh -help for a list of allowable options to qsh. You can also force qsh to use the options specified in an optionfile with qsh -@ optionfile. A valid optionfile might contain the following lines:

# Select queue
-q std.q

# Name your job
-N my_name

# Export some of your current environment variables
-v var1[=val1],var2[=val2],...

# Use the current directory as working directory
-cwd

Interactive jobs are not spooled if the necessary ressources are not available, so either your job is started immediately or you are notified to try again later. Also, interactive jobs will always fail if the implemented transient slot limits (see the section "Slot limitations" in the resource requirements and limitations tutorial for more information) are exceeded. In such cases, submit your interactive session with the option

qsh -now no [...]

Note: Make sure to end your interactive sessions as soon as they are no longer needed!

Interactive sequential jobs

Start an interactive session for a sequential program simply by executing

qsh

Prepare your session as needed, e.g. by loading all necessary modules within the provided xterm and then start your sequential program on the executing node.

Interactive parallel jobs

For a parallel program execute

qsh -pe parallel-environment number-of-slots

with the SGE's parallel environment of your choice (see the list of available parallel environments with qconf -spl) and the number of processes/threads you intend to use. This is not different from submitting a parallel job with qsub.
Start your parallel MPI program as depicted within the script.sh files for parallel MPI batch jobs above. For OpenMP jobs export the OMP_NUM_THREADS variable with export OMP_NUM_THREADS=$NSLOTS and start your job.

Monitoring jobs (qstat)

To get information about running or waiting jobs use

qstat [options]

To shorten the output of qstat execute either qstat|grep -v hqw to filter all pending jobs in hold state, or qstat -s r to display the running jobs only.

Other options of qstat:

-u user Typical usage: -u $USER	Print all jobs of a given user, print all my jobs..
-j job-id	Prints full information of the job with the given job-id.
-f	Prints all queues and jobs.
-help	Prints all possible qstat options.

In case of pending jobs, you might also get some hints on why your job with the job identifier job-id is still waiting in queue, by executing

qalter -w p job-id

You can also verify a submitted job with

qalter -w v job-id

If the previous command delivers the following message, there's something wrong with the job and it will never be able to run:

verification: no suitable queues

Deleting a job (qdel)

To delete a job with the job identifier job-id, execute

qdel job-id