General Purpose GPU Processing On The UIBK Leo Clusters

Performance enhancement using General Purpose GPUS (GPGPU) plays an increasing role in HPC. The ZID plans to support this type of processing by gradual addition of GPU resources to its HPC infrastructure.

Currently available GPU equipment:

Pilot installation on Leo4

one GPU node with the following hardware:

2 sockets Intel Xeon 6132 with 14 cores each (total usable CPUs: 28)
384 GB main memory
4 Datacenter Grade GPUs Nvidia Tesla V100 with 32GB of memory each and NVLink interconnect.

Note: At the current stage, the ZID HPC team is interested in building and improving its GPU know-how. Interested users are highly welcome to contact us and discuss their plans and needs.

The SGE batch scheduler has been configured to exclusively allocate any desired (up to the maximum available) number of GPUs to a job. Several jobs using distinct GPUs may run simultaneously on the same node.

How To Use GPUs

To successfully run a program on a GPU node in a compute cluster, the following steps are necessary:

The program must use GPU libraries or must be written in a language / toolkit that allows generation of GPU code.
The job which starts your program must be submitted to a GPU enabled queue.

Resource Limitations

Since GPUs cannot be shared, we have set the default/maximum run time limit (h_rt) for gpustd.q to 96 hours (4 days). Please adjust the granularity of your jobs accordingly

As of June 2020, our SGE setup on Leo4 supports only non-interactive batch jobs. When an interactive job is submitted to the GPU queue, the queue will fail, requiring manual intervention by system administration. If by accident, you have submitted an interactive job please contact us to correct the situation.

Compiling and running programs using the CUDA toolkit

Code for GPUs can be generated with CUDA and OpenCL. Due to slow adoption of the latter, most available software supports CUDA only. Documentation for CUDA can be found on NVidia's CUDA Homepage.

To find out which CUDA versions are available on the cluster, issue the command

module avail cuda

on the login node. Load the appropriate Cuda module by

module load cuda/version

CUDA programs may be compiled on the login node, but must be tested and run on a node that has GPUs installed.

Python Support for GPUs

Our primary line of support for Python on our Leo clusters is the Anaconda/Miniconda distribution. We have installed several Conda environments; some of them consist of dozens or even hundreds (e.g. python-anaconda) of Python packages. Currently, two environments offer GPU support:

pytorch
tensorflow-gpu

To list all installed Conda environments, issue

module avail Anaconda3

By loading the Anaconda3/miniconda-base module and defining your own Conda environments, you may add your own packages. For more details, see the Anaconda documentation.

Running GPU Batch Jobs

Modify the following template to suit your needs to create your GPU batch jobs:

#!/bin/bash
#$ -N my_gpu_job
#$ -q gpustd.q
#$ -cwd
#$ -l h_rt=1:00:00    # run time
#$ -l h_vmem=12G      # main memory per job slot
#$ -l gpu=1           # number of GPUs
#$ -pe openmp 7       # number of job slots (CPUs)
#$ -j yes             # join stdout and stderr (needed for "set -x" below)

module load cuda/xxxx.yyyy
module load ### --- more modules if necessary

# log some details of execution environment
set -x
cat $0
env|sort

# for programs that use OpenMP and/or Intel MKL
# replace this by methods appropriate to your program
export MKL_NUM_THREADS=$NSLOTS
export OMP_NUM_THREADS=$NSLOTS

# replace the following by your command and its GPU/CPU selection syntax
my_command --ngpus=$NGPUS --ncpus=$NSLOTS

Allocation of CPUs, GPUs and memory should be approximately proportional fractions of the total resources available in the GPU node. Note that the h_vmem parameter is the amount of memory per job slot.

Environment Variables in GPU Jobs

When your GPU program is started on a GPU node, it will automatically discover the GPUs assigned to your job, influenced by the environment variables described below. These are set by SGE for GPU jobs. Please do not change these variables!

NGPUS: The number of allocated GPUs available for this job. Corresponds to the -l gpu=ngpu SGE parameter. Please allocate only one GPU unless you know that your program can efficiently use multiple GPUs in parallel. The --ngpus=$NGPUS of above example should be replaced by your program's syntax to choose the number of devices to be used.
NSLOTS: The number of allocated CPUs/job slots available for this job. Corresponds to the -pe openmp nslots SGE parameter.
Note: Well-written GPU programs can efficiently use multiple CPU cores in addition to GPUs for hybrid performance exceeding the capabilities of GPUs or CPUs taken separately. Our example uses 7 job slots - modify this parameter to tune the balance GPU-CPU according to the capabilities of your program.
CUDA_VISIBLE_DEVICES: GPUs selected by SGE for use by your job.
WARNING: Contrary to many suggestions in the web, do not change this variable or your job will interfere with other GPU jobs running on the same node, resulting in effects ranging from slowdown to complete failure of program runs (e.g. due to failed memory allocations). This may - and often will - affect other users. Please also make sure that your program does not override this setting.
To assign your GPUs within the available selection, use the CUDA call cudaSetDevice (int device) or the equivalent method of your programming environment (e.g. torch.device("cuda:n") in pytorch or tf.device('/gpu:n') in tensorflow).
The device number n is relative to the selection and ranges from 0 to $NGPUS - 1.
See also the top answer in the Stackoverflow article How to choose device when running a CUDA executable? starting with "The canonical way..."

GPU Performance

From Nvidia's specifications and third party information, the V100 offers the following theoretical peak performance data per GPU:

Performance	Double Precision	7.8 TFlop/s
	Single Precision	15.7 TFlop/s
	Half Precision	31.3 TFlop/s
	Tensor	125 TFlop/s
Interconnect	NVLINK	300 GB/s
Memory	Capacity	32 GB
Memory	Bandwidth	900 GB/s