Cuda

Description

The Compute Unified Device Architecture (CUDA) toolkit provides a C/C++ interface for programming Nvidia's graphics processing units (GPUs). It also includes a number of GPU-accellerated libraries (such as cuFFT, cuBLAS, etc.), as well as an OpenCL interface. See the NVidia documentation for a broader overview and more detailed information.

External documenation:

Usage

Programming in CUDA, respectively programming GPUs and getting good performance is rather tricky and - due to the multitude of libraries and programming approaches - out of the scope of this document. A collection of CUDA C/C++ tutorials can be found here.

The following sections will therefore be limited to the description of how to reserve GPUs for your jobs, how to compile a very simple test program and the virtual memory requirements of the CUDA API.

Reserving GPUs for SGE batch jobs

When submitting a job to the SGE batch scheduler, please specify the number of GPUs required by adding the following SGE option:

-l gpu=#REQUIRED_GPUS

Issue the command qhost -F gpu to see how many GPUs are available on each of the system's compute nodes.

CURRENTLY NO EXCLUSIVE RESERVATION OF GPU DEVICES AVAILABLE!

In order to safely use GPUs (on nodes with multiple devices, please do always reserve all available GPUs on the node, respectively the complete node for your jobs.

Note: Currently there is no support for parallel jobs, which require GPUs on multiple nodes. Please contact the HPC administration if your application requires such a setup.

Compiling and running a very simple GPU program

Before compiling and/or running a CUDA application, load the appropriate cuda module (get all available cuda modules with module avail cuda):

module load cuda/version

A very simple CUDA program file simple_saxpy.cu might look like this:

#include<stdio.h>

// CUDA saxpy implementation - 
//  benchmark optimum for NVidia graphics cards due to madd instruction!
__global__
void saxpy_cuda(int n, float a, float *x, float *y)
{
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n) y[i] = a*x[i] + y[i];
}

int main( void) {

        int    n = 1<<10, i;
        float  c = 2.0, *x, *y, *d_x, *d_y;
        int    sm_blocksize, nblocks;

        // Set SM blocksize - device dependent
        sm_blocksize = 256;
        // Determine number of blocks to be scheduled
        nblocks  = (n + sm_blocksize-1) / sm_blocksize;

        // Allocate and initialize vectors in main memory
        x  = (float *) malloc(n*sizeof(*x));
        y  = (float *) malloc(n*sizeof(*y));
        for ( i=0; i<n; i++) {
                *(x+i)  = (i+1)/2.0;
                *(y+i)  = n-i;
        }

        // Initialize vectors in GPU memory
        cudaMalloc(&d_x, n*sizeof(*d_x));
        cudaMalloc(&d_y, n*sizeof(*d_y));
        cudaMemcpy(d_x, x, n*sizeof(*d_x), cudaMemcpyHostToDevice);
        cudaMemcpy(d_y, y, n*sizeof(*d_y), cudaMemcpyHostToDevice);

        // invoke saxpy on GPU
        saxpy_cuda<<<nblocks, sm_blocksize>>>(n, c, d_x, d_y);

        // Copy result back to main memory
        cudaMemcpy(y, d_y, n*sizeof(*d_y), cudaMemcpyDeviceToHost);

        return 0;
} 


The CUDA executable simple_saxpy can be compiled from this CUDA program file by invoking the CUDA compiler driver nvcc with the following command line:

nvcc -o simple_saxpy simple_saxpy.cu

Try nvcc --help for more information about usage and available options

Note: Currently only the Gnu compilers are fully supported with nvcc on Linux systems.

Virtual memory requirement of the CUDA API

The CUDA API per default allocates half of the available virtual memory on the system (up to a certain limit). So in order to stay within your virtual memory limits you need to specify at least twice of what your application actually requires on the host system. As an example: if your program allocates 5 Gigabyte of main memory (e.g. with a malloc statement) you need to specify at least 10 Gigabyte of virtual memory when submitting your job to the SGE batch scheduler. It is necessary to reserve a bit more than that to take into account any job overhead.

To reserve 12 Gigabyte add the SGE option -l h_vmem=12G to your job submission, for example:

qsub -l gpu=1 -l h_vmem=12G my_jobscript.sh

Nach oben scrollen