Comparison of Features of UIBK HPC Systems

System

Leo3, Leo3e

Mach

VSC3

Operating System

CentOS 6.3, CentOS 7.1

SuSE SLES 11.1

Scientific Linux 6.6

Architecture

Infiniband Cluster
Leo3:
162 nodes (physical machines)
12 CPUs/node (2 sockets @ 6 cores)
1944 CPUs total
24 GB/node
2 GB/CPU
Leo3e:
45 nodes (physical machines)
20 CPUs/node (2 sockets @ 10 cores)
900 CPUs total
64 (nodes 44+45: 512) GB/node
3.2 (nodes 44+45: 25.6) GB/CPU

ccNUMA SMP
1 node = 256 vnodes (virtual nodes = processor sockets)
8 CPUs/vnode (8 cores per socket)
2048 CPUs total
8 GB/CPU
64 GB/vnode
16 TB total

Infiniband Cluster
approx. 2000 nodes
16 CPUs/node (2 sockets @ 8-cores)
32000 CPUs total
Standard nodes: 64 GB → 4 GB/CPU
Big nodes: 128 or 256 GB → 8/16 GB/CPU

Topology

Leo3 (Leo3e): 7 (2) units with up to 24 nodes each. Blocking factor 1:2 between units.

Fat Tree. Batch Scheduler allocates vnodes on a best-fit basis.

8 islands, each consisting of up to 24 units with 12 nodes each (i.e. up to 2304 nodes total). Blocking factor 1:2 within islands, 1:4 between islands.

File System Architecture

$HOME: Home directory /home/group/user, shared, backup
$SCRATCH: Scratch directory /scratch/user, shared, no backup

$HOME: Home directory /home/group/user, backup
$SCRATCH: Scratch directory /scratch/group/user, no backup

$HOME: Home directory /home/project/user, shared, no backup
$SCRATCH: Local scratch directory /scratch, specific to each node, no backup, automatic deletion some time after job terminates
$GLOBAL: Global scratch directory /global/project/user, shared, no backup

Job Scheduler

SGE

PBS pro

SLURM

Allocation Granularity

1 CPU.
Leo3: 1 node = 12 CPUs + 24GB GB memory
Leo3e: 1 node = 20 CPUs + 64GB (nodes 44+45: 512 GB) memory
Note that several jobs from different users may run on the same node, so this system is suitable for the entire range from small sequential programs, multithreaded (e.g. OpenMP) programs up to 12 (Leo3) resp. 20 (Leo3e) parallel threads, up to relatively large parallel MPI jobs using hundreds of CPUs. Please pecify realistic memory requirements to assist job placement.

Multiples of 1 vnode (= 8 CPUs, 64 GB memory each)
Mach is a special machine for large parallel jobs with high memory demands. If you need little memory or fewer than 8 CPUs per program run, please consider using another system.

Multiples of 1 node (= 16 CPUs + 64GB, 128GB or 256 GB memory each)
Each node is assigned to a job exclusively. Jobs can span many nodes up to very large parallel MPI jobs using many hundreds of CPUs. If your individual programs cannot profit from the minimum degree of parallelism (16 threads or tasks), please consider employing a job farming scheme (description by LRZ; please note that their conventions are different) or using a different system.

Job Submission

qsub scriptfile

qsub scriptfile

sbatch scriptfile

Query Jobs

qstat -u $USER: List all my jobs
qstat -j jobid: Detailed information on jobid

qstat [-wTx] -u $USER: List all my jobs [-w wide format, -T estimated start time, -x include finished jobs]
qstat [-x] -f jobid: Detailed information on jobid [-x if finished]

squeue -u $USER: List all my jobs
squeue -j jobid -o '%all' scontrol [-dd] show job jobid: Detailed information on jobid

Cancel Jobs

qdel jobid

qdel jobid

scancel jobid

Format of batch script file: All systems permit supplying processing options as command line parameters of submit command (qsub/sbatch) or as directives in batch script file (recommended and described below).

Format of command line options:	[ option [parameters] ] [...]	multiple options in one command line
Format of directives:	prefix option [parameters]	separate line for each optioin
prefix depends on batch system:	#$ (SGE - leo3, leo3e), #PBS (PBS pro - mach), #SBATCH (Slurm - vsc3)

Options are documented in man page of respective submit command (man qsub or man sbatch)

General Scheduler Directives

#!/bin/bash
#$ -N jobname (optional)
#$ -o outfile (default: jobname.ojobid)
#$ -e errfile (default: jobname.ejobid)
#$ -j yes|no  (join stderr to stdout)

#$ -cwd  (run job in current directory)
         (default: $HOME)

#!/bin/bash
#PBS -N jobname (optional)
#PBS -o outfile (default: jobid.o)
#PBS -e errfile (default: jobid.e)
#PBS -j yes|no  (join stderr to stdout)

# after last directive
cd $PBS_O_WORKDIR  (run job in current directory)
                   (default: $HOME)

#!/bin/bash
#SBATCH -J jobname   (optional)
#SBATCH -o outfile   (default: slurm-%j.out)
                     (stderr goes to stdout)


#SBATCH -D directory (run job in specified directory)
                     (default: current directory)

[-p partition]optional -p partition selects type of node by required memory (GB):

partition ::=  { mem_0064     (default) |
                 mem_0128 | mem0256 }

Notification Directives

#$ -M mail-address
#$ -m b|e|a|s|n  (begin|end|abort|suspend|none)

#PBS -m mail-address
#PBS -m b|e|a|n  (begin|end|abort|none)

#SBATCH --mail-type=(BEGIN|END|FAIL|REQUEUE|ALL)
#SBATCH --mail-user=user

Resource Directives

Run time

#$ -l h_rt=[hh:mm:]ss

Tasks, Threads See Task Distribution below.

Per slot virtual memory size (bytes)

#$ -l h_vmem=size

Default: 2GB (Leo3) 1GB (Leo3e)

Per slot stack size limit (bytes)

#$ -l h_stack=size

Run time

#PBS -l walltime=[hh:mm:]ss

Tasks, Threads

#PBS -l select=ntask:ncpus=nthread: Request ntask times nthread CPUs.
See Task Distribution below.

Memory

#PBS -l select=ntask:mem=size{mb|gb}: Request size MB or GB of memory for each of ntask tasks.

Run time

#SBATCH -t mm|mm:ss|hh:mm:ss|days-hh[:mm[:ss]]

Nodes, Tasks, Threads

#SBATCH -N minnodes[-maxnodes]: Request number of nodes for job.
#SBATCH -n ntasks: Request resources for ntasks tasks.
#SBATCH -c nthreads: Request nthreads CPUs per task (default: 1)

Memory

#SBATCH -mem=MB: Request MB megabytes per node
#SBATCH -mem-per-cpu=MB: Request MB megabytes per CPU

Task Distribution

#$ -pe parallel-env nslots: Request a total of nslots CPUs for parallel environment.

parellel-env is one of:
openmpi-Xperhost
(Leo3: X={1|2|4|6|8|12})
(Leo3e: X={1|2|4|6|8|10|12|14|16|18|20})
openmpi-fillup
openmp

#PBS -l select=ntask:ncpus=1: For MPI jobs running ntask processes
#PBS -l select=1:ncpus=nthread: For OpenMP or other multithreaded jobs running nthread threads.

#SBATCH -m node-distribution-method[:socket-distribution-method] where node-distribution-method ::= {block|cyclic|arbitrary|plane=options} socket-distribution-method ::= {block|cyclic}
#SBATCH -c cpus-per-task: Hybrid programming: number of threads per MPI task

For more options and details, see man sbatch

Interactive jobs

{qsh|qlogin} [ -pe parallel-env np ]: qsh starts an xterm-session,
qlogin starts an ssh-like interactive session.

qsub -I [ -l select=ntask:ncpus=nthread ]: starts an ssh-like interactive session

TBD

Remarks

Any non-directive (e.g. command) terminates processing of directives in script.

options may be parameterized using macros
e.g. %j (job id), %u (user name)