Lecture 4

EN600.
407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Lecture 4: Cuda Programming
Device Multi-Processor SP SP SP SP SP SP Multi-Processor SP SP SP SP SP SP Multi-Processor SP SP SP SP
Multi-Processor
Stream Processor Global Memory

Registers Local Memory
SP SP
Constant Memory
Texture Memory
Shared Memory
Local Memory
Local Memory
Local Memory
Device Memory
A kernel is executed as a grid A grid is a collection of thread blocks A thread block is a collection of threads
Thread blocks and threads are given unique identifiers Identifiers be 1D, 2D or 3D

Used to help identify which part of a problem a thread/block should operate on
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Device Grid B (0,0) B (1,0) B (2,0) B (0,1) B (1,1) B (2,1)
Thread Block (1,1) T (0,0) T (1,0) T (2,0) T (0,1) T (1,1) T (2,1)
Was available last week Due on 02/27/08 (next week)
If you want access to CUDA hardware, email me: bolitho@cs.jhu.edu
Tasks Install the CUDA runtime, and CUDA SDK Get it working! Write a simple CUDA program: compute sum, inner and outer products of a large vector.
CUDA Language CUDA Compilation Tool Chain CUDA Debugging Tools
CUDA: Compute Unified Device Architecture Created by NVIDIA
A way to perform computation on the GPU

Specification for:
A computer architecture A language An application interface (API)
2/19/2008
CUDA defines a language that is similar to C/C++
Allows programmers to easily move existing code to CUDA Lessens learning curve
CUDA defines a language that is similar to C/C++ Syntactic extensions:

Declaration Qualifiers Built-in Variables Built-in Types Execution Configuration
How is CUDA C/C++ different from standard C/C++?
Declspec = declaration specifier / declaration qualifier A modifier applied to declarations of:
Variables Functions
CUDA uses the following declaration qualifiers for variables:
__device__ __shared__ __constant__

Only apply to global variables
Examples: const, extern, static
2/19/2008
Declares that a global variable is stored on the device The data resides in global memory Has lifetime of the entire application Accessible to all GPU threads Accessible to the CPU via API
Declares that a global variable is stored on the device The data resides in shared memory Has lifetime of the thread block Accessible to all threads, one copy per thread block
If not declared as volatile, reads from different threads are not visible unless a synchronization barrier used Not accessible from CPU
Declares that a global variable is stored on the device The data resides in constant memory Has lifetime of entire application Accessible to all GPU threads (read only) Accessible to CPU via API (read-write)
CUDA uses the following declspecs for variables:
Declares that a function is compiled to, and executes on the device
__device__ __host__ __global__
Callable only from another function on the device
2/19/2008
Declares that a function is compiled to and executes on the host Callable only from another the host Functions without any CUDA declspec are host by default Can use __host__ and __device__ together
Declares that a function is compiled to and executes on the device Callable from the host Used as the entry point from host to device
CUDA provides a set of built-in vector types: char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4
Can construct a vector type with special function: make_{typename}(v0, v1, ) Can access elements of a vector type with .x, .y, .z, .w: vecvar.x
dim3 is a special vector type Same as uint3, except can be constructed from a scalar to form a vector: (scalar, 1, 1)
CUDA provides four global, built-in variables threadIdx, blockIdx, blockDim, gridDim Typed as a dim3 or uint3 Accessible only from device code Cannot take address Cannot assign value
2/19/2008
CUDA provides syntactic sugar to launch the execution of kernels

Func<<<GridDim, BlockDim>>>(Arguments, )
Func is a __global__ function
GridDim is a dim3 typed expression giving the size of the grid (i.e. problem domain)
BlockDim is a dim3 typed expression giving the size of a thread block
The compiler turns this type of statement into a block of code that configures, and launches the kernel
Important Differences:
Runtime Library Functions Classes, Structs, Unions
2/19/2008
Code that runs on the device cant use normal C/C++ Runtime Library functions No printf, fread, malloc, etc
There are a number of device specific functions/intrinsics available:

__syncThreads __mul24 atomicAdd, atomicCAS, atomicMin,
Most math functions have device equivalent
There is no malloc or free function that can be called from device code How can we allocate memory?
There is no malloc or free function that can be called from device code How can we allocate memory? From the host Using CUDA 1.1 atomics, write a custom allocator
On a CUDA device, there is no stack By default, all function calls are inlined Can use __noinline__ to prevent (CUDA 1.1)
CUDA supports some C++ features for device code. E.g:

Template functions
All local variables, function arguments are stored in registers NO function recursion
Classes are supported inside .cu source, but must be host only Structs/Unions work on device code as per C
No function pointers
2/19/2008
CUDA source files end in .cu Contain a mix of device and host code/data Compiled by nvcc nvcc is really a wrapper around a more complex compilation process
Input Normal .c, .cpp source files CUDA .cu source code files Output Object/executable code for host .cubin executable code for the device
For .c and .cpp files, nvcc invokes the native C/C++ compiler for the system (eg: gcc/cl) For .cu files, it is a little more complicated
.cu
cpp
.cu
To see the steps performed by nvcc, use the --dryrun and --keep command line options
cudafe
.c
cpp cpp
.0
linker linker
.stub.c
.0
.gpu.c
nvopencc
.ptx
ptxas
.cubin
cubin
2/19/2008
How is a .cubin linked with the rest of the program? Can be:
Loaded as a file at runtime Embedded in data segment Embedded as a resource
Programming is fun, until

The program crashes It produces the wrong result
CUDA programming is even less fun

There is no debugger There is no printf
But, there are many debugging techniques

Debugging software (eg: gdb, Visual Studio) printf
CUDA programming is even less fun

There is no debugger There is no printf
By using a compiler flag, you can emulate by running all code on the host Compiler Flag: --device-emulation

Debugging code on the device is very hard

Can try to write intermediate results to memory
Good for most debugging: can use gdb/printf
and copy back to host to examine

Emulation mode
2/19/2008
By using a compiler flag, you can emulate by running all code on the host Compiler Flag: --device-emulation

Good for most debugging: can use gdb/printf Not a true emulation:
Race Conditions, Memory model differences, etc
10

Lecture 4

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lecture 4

Загружено:

Авторское право:

Доступные форматы

EN600.

Lecture 4: Cuda Programming

Device Multi-Processor SP SP SP SP SP SP Multi-Processor SP SP SP SP SP SP Multi-Processor SP SP SP SP

Stream Processor Global Memory

Used to help identify which part of a problem a thread/block should operate on

Device Grid B (0,0) B (1,0) B (2,0) B (0,1) B (1,1) B (2,1)

Thread Block (1,1) T (0,0) T (1,0) T (2,0) T (0,1) T (1,1) T (2,1)

Was available last week Due on 02/27/08 (next week)

If you want access to CUDA hardware, email me: bolitho@cs.jhu.edu

CUDA Language CUDA Compilation Tool Chain CUDA Debugging Tools

CUDA: Compute Unified Device Architecture Created by NVIDIA

A way to perform computation on the GPU

CUDA defines a language that is similar to C/C++

CUDA defines a language that is similar to C/C++

CUDA defines a language that is similar to C/C++ Syntactic extensions:

How is CUDA C/C++ different from standard C/C++?

Declspec = declaration specifier / declaration qualifier A modifier applied to declarations of:

CUDA uses the following declaration qualifiers for variables:

__device__ __shared__ __constant__

Examples: const, extern, static

CUDA uses the following declspecs for variables:

Declares that a function is compiled to, and executes on the device

__device__ __host__ __global__

Callable only from another function on the device

CUDA provides syntactic sugar to launch the execution of kernels

Func is a __global__ function

BlockDim is a dim3 typed expression giving the size of a thread block

CUDA defines a language that is similar to C/C++

There are a number of device specific functions/intrinsics available:

Most math functions have device equivalent

CUDA supports some C++ features for device code. E.g:

Programming is fun, until

CUDA programming is even less fun

But, there are many debugging techniques

CUDA programming is even less fun

Debugging code on the device is very hard

Good for most debugging: can use gdb/printf

and copy back to host to examine

Вам также может понравиться

device shared constant

device host global

Func is a global function