In this document (permanently under construction) you will find various important details on the functionality and usage of Mach.

Using CPU and memory

Mach is a large x86_64 shared memory (ccNUMA) machine with the following characteristics:

  • 2048 CPUs (= cores) on 256 physical processor sockets
  • 16 TB of memory, i.e. 64 GB per socket, 8 GB per CPU.

In principle, every CPU in the system can access the entire 16TB of memory installed in the system. To avoid memory contention, the batch-system PBS allocates CPUs in multiples of so-called memory nodes. Each of the 256 memory nodes consists of one 8-Core processor socket and its local memory, so each node has 8 CPUs (=cores) and 64 GB of memory, totalling 2048 CPUs and 16 TB of memory. Effectively, 62 GB per memory node can be used by your programs.

If you submit a job to the PBS queuing system, PBS will look at your CPU and memory requirements and will allocate a CPU-set containing as many nodes as are necessary to fulfill both of your requirements.

Assigned CPU-sets will always be exclusive for your own use. If your program uses less memory or CPUs than requested, or if your PBS-job requests less than a multiple of 8 CPUs, the unused resources will sit idle. PBS will not assign these unused resources to other tasks.

You can check the static properties of this CPU-set by looking at the output of "qstat -f job-id". PBS always allocates entire memory nodes, so your CPU-sets will consist of multiples of 8 CPUs and 64 GB.

Example: if you request 12 CPUs and 50GB of memory, you will be assigned two memory nodes with a total of 16 CPUs and 128 GB for your exclusive use. If you request 8 CPUs and 100 GB, your CPU-set will consist of two memory nodes to satisfy your memory requirement, totalling 16 CPUs and 128 GB. It is normally a good idea to match your requirements to this granularity, but please request slightly less than an integer multiple of 64 GB of memory (see below).

Consequences of exceeding assigned resources

To avoid memory contention among jobs, we have set up the system such that memory will be shared only within the nodes assigned to a job's CPU-set. So, if your program tries to use more memory than you requested in your PBS-job, it will neither be terminated nor will it use memory of nodes belonging to other jobs, but it will cause paging I/O to the swap area. This may impair the effective or even stable operation of the entire system, so it is imperative that your programs' memory use stays within the limits specified in your qsub command or job script. If you are uncertain about your memory requirements, use a sensible safety margin or observe your program's memory usage.

If the number of CPU-intensive threads in your program exceeds the number of CPUs in your CPU-set, some threads will have to share individual CPUs, generally resulting in poor performance (or in severe cases even instability of the entire system). Since the individual threads of many parallel jobs perform busy-waits on other threads, this performance degradation cannot be observed by an asymmetrical CPU-load.

In fact, the amount of memory is slightly (16 MB) less than 64 GB for every other node. So if you request an integer multiple of 64 GB, every second node may be skipped, leading to sub-optimal placement.

Preventing the creation of excessively many threads

Some programs can be told to use all available resources or even do this by default (Java applications are particularly notorious for this). Since most software developers are not aware of CPU-Sets, this almost invariably leads to the program trying to use 2048 CPUs and 16TB of memory, which may be OK on a desktop machine, but, with MACH, results in gross overusage of your CPU-Set. So please never let your program use all the machine's resources, but, instead, tell it explicitly to use as many CPUs and as much memory (rather: slightly less) as you allocated in your qsub statement.

If there is no obvious way (e.g. command line arguments or configuration file) to tell the program the number of active threads to use, it may be tricked into using the correct number of CPUs with the following workaround (example):

module use /apps/uibk/adm/modulefiles    # you may want to add
module load uibk-adm                     # this to $HOME/.profile
sysconfcpus -n nproc ./my-prog arg1 ...

Note: the path to your program must be given explicitly.

Technical background: libsysconfcpus. This library allows to set the number of CPUs returned by the sysconf(3) call explicitly, so this will work for programs that use this call to discover the number of CPUs. It has successfully been tested with Java.

Hints on using MPI

The preferred MPI implementation on Mach is SGI's message passing toolkit MPT. This is a highly optimized MPI implementation, which for some tasks may be an order of magnitude faster than e.g. OpenMPI.

Access MPT by activating the module:
module load intel    (if you use Intel compilers - recommended)
module load mpt   

Please avoid using SGI's "mpicc", "mpif90", etc. commands. These are poorly implemented and will in general not work with makefiles as expected, sometimes even causing malfunctions (if the environment variable "$CC" is set to "mpicc", your Makefile will act as a fork bomb, same for "$FC" and "mpif90" etc).

Instead, directly call the compiler "icc", "ifort", etc., and use the options
-I/opt/sgi/mpt/mpt-2.04/include   for compiling, and
-L/opt/sgi/mpt/mpt-2.04/lib ... -lmpi   for linking.

See also the following section on pinning threads to CPUs.

CPU placement (pinning) for optimal performance - dplace

If all of the following is true for your jobs:

  • You use less than 8 GB per process or thread
  • You use SGI's MPT for MPI parallelization

then you may safely ignore the following considerations. Otherwise, please read on.

According to experiments, performance on the Altix is very sensitive to fluctuations of CPU assignments, i.e. if the Linux kernel moves around threads or processes, performance of parallel jobs may be severely affected.

To prevent this from happening, you need to control the placement of processes or threads to individual CPUs within a CPUset. Some toolkits (such as SGI's MPT) do this by default and perform well in above mentioned cases. Else, you may improve performance of your code by controlling placement.

One software-independent way to achieve this is the dplace command. For the time being, please consult UIBK HPC staff on usage of this tool.

Selected use cases follow.

Optimal placement using MPT

SGI's MPT is automativally aware of CPU placement requirements hence dplace is not needed. For comparison with the following OpenMPI example, we give the pertinent lines from a job script.

#!/bin/bash
# [...]
# adjust as needed
#PBS -l select=16:ncpus=1

module load intel/<version>
module load mpt/<version>

cd $PBS_O_WORKDIR
NSLOTS=`cat $PBS_NODEFILE | wc -l`
mpirun -np $NSLOTS ./your_mpi_executable [extra params] 

Using dplace with OpenMPI

We recommend using the settings as in the following job fragment

#!/bin/bash
# [...]
# adjust as needed
#PBS -l select=16:ncpus=1

module load intel/<version>
module load openmpi/<version>

cd $PBS_O_WORKDIR
NSLOTS=`cat $PBS_NODEFILE | wc -l`
dplace -s 1 -c 0-$(($NSLOTS - 1)) mpirun -np $NSLOTS ./your_mpi_executable [extra params] 

Explanation: OpenMPI starts one inactive shepard process. The "-s 1" (skip 1) option tells dplace to skip this first process in its CPU allocation. (Other MPI implementations have different requirements, e.g. MVAPICH would need "-s 2".) The remaining processes are assigned all allocated CPUs of the CPU-set in sequence.

Using Intel Math Kernel Library (MKL)

As a result of increasing modularity supporting a growing variety of use cases, the Intel MKL has been split into a large number of interdependent component libraries. Consequently, the traditional "-lmkl" link argument needs to be replaced by a complex arrangement of arguments depending on the actual requirements (e.g. Compiler version, architecture, default integer size, multithreading, MPI implementation).

As an aid to construct the correct command line for linking programs that use the MKL, Intel provides an online tool, the Intel Math Kernel Library (Intel MKL) Link Line Advisor". Usage of this tool is recommended for all uses of Intel MKL.

Hint for using the tool: if you change any of your choices, you may get incorrect results. So instead, hit the "Reset" button and start over again.

Observing resources used by your programs

Mach is a huge and tightly coupled system. Although the automatic creation of CPUsets by the batch system allows a moderate degree of containment of individual jobs, there are job activities that may massively impair operation of the entire system and need to be avoided at any rate.

Most of the malfunctions that can render the entire system unusable are caused by the following problems of user programs:

  • Trying to use more memory than allocated in the qsub command, leading to excessive page faults, and
  • starting more CPU-intensive threads than specified in the qsub command, leading to process/thread contention.

Among other reasons for degraded performance are:

  • Underusage of requested resources due to imbalanced CPU-load or mismatch of CPU-set size and number of threads/processes,
  • Incorrect or missing CPU-placement of threads,
  • Usage of non-recommanded or incorrectly configured MPI implementations (e.g. using TCP/IP instead of shared memory for local communication).
  • Excessive I/O...

It is every user's responsibility to make sure that their programs operate within the limits they specified in their qsub-Commands. In addition to a good understanding of program behaviour, the following tools are helpful in controlling resource usage of your programs. Please closely observe resource usage of your jobs until you are sure that you fully understand your programs' usage patterns, in particular whenever the system appears to be sluggish or unresponsive.

Important Note: The tools ps, top, jkutop, and cpusetinfo described in this section access the /proc filesystem to gather information. E.g., a typical ps invocation will open four files in /proc per process, resulting in order of 100000 or more files being read and parsed in a typical load situation. So they consume a significant amount of system resources and are not suitable for continuous usage. Please use sparingly and to not forget to terminate top or jkutop when you are done watching your processes.

qstat

qstat -u $USER lists your queued and running jobs.

qstat -f job-id displays details of job job-id. Note that most of the resource-related output of this command refers to static properties of your job, such as allocations by PBS.

top

top -u $USER displays a dynamic listing of all processes running under your user ID. By default, not all relevant parameters are displayed. While top is running, type "h" to display a short help summary or use the following initialization file (copy and paste the following lines into your command line):

cat >$HOME/.toprc <<STOP
RCfile for "top with windows"           # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def     fieldscur=ABEJWHIOQTKNMcdfgPlrSUvYz{|X
        winflags=128313, sortindx=13, maxtasks=0
        summclr=1, msgsclr=1, headclr=3, taskclr=1
STOP

Most important key strokes in top (case sensitive):

h
help
H
toggle display of individual threads
f
add / remove fields
o
change display order of fields
F or O
select sort field
W
Save changes in configuration file $HOME/.toprc

Things to watch for

(see also "man top"):

S
Status. All processes or threads that should be doing work should be in status "R" (running).
VIRT RES
Virtual and resident memory. For a program using all requested memory these two numbers should be roughly the same. If RES is much less than VIRT, and RES is the memory size of your CPU-set, this may be a sign of trouble (memory starvation, paging).
%CPU
Depending on whether thread display is on or off, for active threads, this number should be around 100% or nthreads*100%.
nFLT
Number of Page faults. This number should be very low (maybe a few dozens at most). Otherwise watch out for memory starvation.
P
Processor number. Watch out for multiple CPU-intensive threads on the same processor, indicating that your program uses more threads than you requested.
Other useful options
-n n
runs n cycles and then quits
-b
Batch mode. Do not format for terminal but write output in form suitable for editing or piping to another command

Example

top -u $USER -b -n 1 > top.out

See also: Linux Forums: Using Top More Efficiently

procps-ng: ps, top, yaps

A locally modified of top(1) and ps(1) can be made available by

module use /apps/uibk/adm/modulefiles    # you may want to add
module load uibk-adm                     # this to $HOME/.profile

When this module is active, use "man top" and "man ps" to obtain documentation. The local addition is these commands' ability to display the "%SYS" attribute of processes.

The procps-ng implementation of top has been completely redesigned, can display more attributes, and is easier to customize than the default top. Its .toprc file is not backwards compatible and may contain non-editable characters.

If you have added uibk-adm to your profile, you may copy the file .toprc into your HOME-directory. These initialisation files cause the procps-ng top and jkutop (see below) to display additional relevant process attributes. You need a wide window for this to work.

The locally written script yaps calls ps with a number of important ps-options.

jkutop

A modified version of "top" has been developed by JKU staff, called "jkutop". Interactive key bindings differ from top; jkutop also accepts mouse events if your terminal emulation forwards these. Hit "h" to get help.

To get relevant information from jkutop, download the file .jkutoprc into your HOME-directory or copy and paste the following contents:

cat >$HOME/.jkutoprc <<STOP
#AUTOMATICALLY GENERATED by jkutop (called as jkutop), careful when editing
#The parser is not very robust, so if you edit this manually, better do it
#right. If you screw up (or to go back to defaults) delete this file
# SWAP does not work always zero dont use
Fields: PID nTHR CPUSET USER PR nMin nMaj dFLT NI VIRT RES S %SYS %CPU %MEM TIME+ COMMAND
STOP

Issue the shell command jkutop -u $USER. This command, works similar to top but uses fewer resources at the cost of reduced accuracy.

cpusetinfo

A locally developed utility gives a job granular overview of used resoures. Usage example:

module use /apps/uibk/adm/modulefiles    # you may want to add
module load uibk-adm                     # this to $HOME/.profile
cpusetinfo | hgrep $USER

All percentages are relative to the resources requested by each job. Here it is easy to observe over- and underuse of resources. In particular, look at CPUs, USED, %CPU, %SYS, GB, %USED, %RSS, MAJFLT, (procs,lwps).

Interactive sessions using screen and qsub -I

When you log on to Mach, your processes will run in the login-CPU-Set called "/user", which has only 16 CPUs and 32 GB of memory. Login sessions should never be used for any work using considerable amounts of resources, lest interactive usage be impaired for all users.

For CPU-intensive work, try to run non-interactively and use the batch system. If your work is necessarily interactive, you may use "qsub -I" to start an interactive batch session. These sessions, however, will be terminated whenever the network connection is interrupted or your local workstation is shut down.

To secure your interactive sessions against this type of interruption, you may use the screen utility on Mach. After logging in with "ssh", issue "screen -D -R". If a previous screen-session exists, it will be reconnected, else, a new session will be created. This does not work for graphical (X-Window-System) sessions because screen does not have an X-proxy.

Starting a robust interactive job

  1. connect to MACH using your favourite SSH-client
  2. screen -D -R
  3. qsub -I [more qsub options]
  4. When your interactive batch job has started, use any shell commands including CPU-intensive programs. By default, your CPU-set will contain 8 CPUs.

This starts an interactive PBS-job inside a screen session.

Detaching your interactive job

Hit <ctrl-a> <ctrl-d>

Your screen-session will be detached from your terminal and continue in the background, and you may safely log out. If your ssh-connection is terminated (e.g. due to a network problem or because you close your local window or turn off your local computer), your screen-session will be automatically detached as well.

Re-attaching your screen-session

  1. SSH-connect to MACH as above
  2. screen -D -R

Your existing screen-session will be reconnected. If the session was lost (e.g. due to PBS terminating your job or a system restart), a new screen-session will be created.

The screen-utility offers many options and can even serve as a character-oriented window manager. For more details, see "man screen(1)".

Please note: the batch system assigns resources exclusively to interactive jobs in the same way as with normal jobs. These resources are not available to other users during the entire duration of these jobs. Please use responsibly and make sure you terminate interactive jobs when you are finished.

Matlab and Mathematica: Multithreading and parallelization on Mach

Matlab and Mathematica have similar multithreading capabilities, which can start huge numbers of threads and may lead to serious trouble on Mach unless controlled by the user. The specific methods to prevent this behaviour are described below for each product.

Common features

  • Both products use Java for their GUI, graphics and possibly some other functions. The Java engine has high startup cost, starts a large (sometimes huge) number of threads and should be generally avoided in batch jobs.
  • In addition, both products have builtin shared-memory multithreading capabilities that are automatically used for certain functions (such as many Linear Algebra and FFT routines) and are substantially faster than their sequential counterparts. For Matlab, this type of parallelization is independent of the functionality provided by the Parallel Computing Toolbox and often far more efficient.
  • Both types of parallelization need to be controlled by the user on Mach.

Matlab

Prior to R2014a, by default, Matlab will sometimes incorrectly try to grab all CPUs in the system (2048 in the case of Mach, leading to 2048 threads contending for 8 or 16 CPUs). This behaviour is definitely undesirable, so please avoid calling earlier versions of Matlab with default settings.

Fixing computational multithreading misbehaviour with earlier versions of Matlab

Please see also the hints on Parallel / Multithreaded Matlab use.

The following fix is not necessary for Matlab R2014a and onward, but please continue to read the following paragraphs for information on controlling Java multithreading behaviour.

For earlier versions of Matlab, set the number of threads with the Matlab statement

maxNumCompThreads(n)

where n is the number of desired threads, typically the size of the CPU-Set that you are using. You may ignore the warning that this function is deprecated. If you have started your PBS job using the select statement

-l select=1:ncpus=n

PBS sets the environment variable OMP_NUM_THREADS to the correct size of the CPUset, and you may automate adjustment of permitted threads by the Matlab statement

maxNumCompThreads(str2num(getenv('OMP_NUM_THREADS')))
Avoiding unnecessary Java threads

This and the following tips apply to all versions of Matlab. Called with default options, Matlab will start a large number of Java threads to support the GUI and other functions. The desktop-GUI works very slowly over the network and is not recommended. Three options are recommended to control or even turn off the Java/GUI-related components of Matlab.

-nodesktop
Do not start the MATLAB desktop. Uses the current terminal for commands. The Java virtual machine will be started. Use this option for interactive work. GUI functions such as the online documentation (command: doc) or graphics are available and will open in separate windows.
-nodisplay
Suppresses all graphical functions. The Java engine is still started. Use this option for non-interactive batch usage if you determine that your jobs need the Java engine.
-nojvm
Completely turns off the Java engine. In particular, this suppresses the desktop, all graphics functions and possible other functions depending on Java. Use this option for non-interactive batch usage unless you need Java for spedific functions.
Running Matlab single-threaded

In certain cases (e.g. job farming, i.e. running several Matlab instances independently in the same job) it may be desirable to run Matlab single-threaded. Use the Matlab command line option -singleCompThread, preferably in combination with the other options described above.

Using the Parallel Computing Toolbox

In contrast to Matlab's builtin shared-memory multithreading capabilities, the Parallel Computing Toolbox provides OS-independent access to coarse-grained distributed-memory parallelism, using special constructs such as parfor and spmd and special datatypes such as distributed and composite variables.

Before using these functions, please consider the following properties and limitations of the Parallel Computing Toolbox:

  • Work is distributed to so-called worker instances of Matlab in a client-server configuration. The interactive session is the client, the worker instances are the servers. Before using constructs such as parfor, spmd, etc., these worker instances should be created explicitly by a mypool=parpool(n) statement, where n is the desired number of workers (should be equal to the number of CPUs in your CPU set). If you do not create the worker pool explicitly, Matlab will automatically create at most 12 worker instances, even if the CPU-set is larger.
  • The individual worker instances execute all code sequentially, no multithreading is used.
  • Due to the distributed memory architecture of the Parallel Toolbox, data needed by the workers is automatically transferred to the workers even when they execute on the same host. This may create substantial communication overhead in comparison to the builtin shared memory multithreaded parallelism.
  • Consequently, the Parallel Computing Toolbox should only be used after verifying that your programs cannot sufficiently profit from the builtin multithreading, and that parallelization and communication overheads do not defeat the speed gains of parallel execution.
External links

Mathematica

In the default setup, the number of Java and OMP-threads created by Mathematica 8.0.4 will be equal to the number of CPUs (2048 on Mach) plus some fixed number, which may lead to poor performance and may impair overall system behaviour. Our setup of Mathematica 8.0.4 on Mach avoids the allocation of excessive numbers of threads by setting the environment variables OMP_NUM_THREADS (unless defined) and _JAVA_OPTIONS to sensible values. Specifically, if OMP_NUM_THREADS is not defined, it will be set to 8, else, its current value will be left untouched.

To suppress the GUI, which is useless in batch jobs, anyway, the command math may be used to directly access the Mathematica kernel. To start your Mathematica script directly, use math -noprompt -script myjob.m.

If your jobs profit from a larger number of CPUs, you may submit them in a suitably sized cpuset as illustrated in the following job fragment.

#!/bin/bash
#PBS -l select=1:ncpus=n
# selects a CPU set with n CPUs, sets OMP_NUM_THREADS=n
# ... other PBS options

module load mathematica/8.0.4
cd $PBS_O_WORKDIR

math -noprompt -script myjob.m

Note: the sysconfigcpus workaround described elsewhere causes Mathematica to crash.

External links