Вы находитесь на странице: 1из 10

EN600.

407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

Lecture 4: Cuda Programming

Device Multi-Processor SP SP SP SP SP SP Multi-Processor SP SP SP SP SP SP Multi-Processor SP SP SP SP

Multi-Processor

Stream Processor Global Memory


Registers Local Memory

SP SP

Constant Memory

Texture Memory
Shared Memory

Local Memory

Local Memory

Local Memory

Device Memory

A kernel is executed as a grid A grid is a collection of thread blocks A thread block is a collection of threads

Thread blocks and threads are given unique identifiers Identifiers be 1D, 2D or 3D

Used to help identify which part of a problem a thread/block should operate on

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

Device Grid B (0,0) B (1,0) B (2,0) B (0,1) B (1,1) B (2,1)

Thread Block (1,1) T (0,0) T (1,0) T (2,0) T (0,1) T (1,1) T (2,1)

Was available last week Due on 02/27/08 (next week)

If you want access to CUDA hardware, email me: bolitho@cs.jhu.edu

Tasks Install the CUDA runtime, and CUDA SDK Get it working! Write a simple CUDA program: compute sum, inner and outer products of a large vector.

CUDA Language CUDA Compilation Tool Chain CUDA Debugging Tools

CUDA: Compute Unified Device Architecture Created by NVIDIA

A way to perform computation on the GPU


Specification for:
A computer architecture A language An application interface (API)

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

CUDA defines a language that is similar to C/C++

Allows programmers to easily move existing code to CUDA Lessens learning curve

CUDA defines a language that is similar to C/C++

CUDA defines a language that is similar to C/C++ Syntactic extensions:


Declaration Qualifiers Built-in Variables Built-in Types Execution Configuration

How is CUDA C/C++ different from standard C/C++?

Declspec = declaration specifier / declaration qualifier A modifier applied to declarations of:

Variables Functions

CUDA uses the following declaration qualifiers for variables:

__device__ __shared__ __constant__


Only apply to global variables

Examples: const, extern, static

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

Declares that a global variable is stored on the device The data resides in global memory Has lifetime of the entire application Accessible to all GPU threads Accessible to the CPU via API

Declares that a global variable is stored on the device The data resides in shared memory Has lifetime of the thread block Accessible to all threads, one copy per thread block

If not declared as volatile, reads from different threads are not visible unless a synchronization barrier used Not accessible from CPU

Declares that a global variable is stored on the device The data resides in constant memory Has lifetime of entire application Accessible to all GPU threads (read only) Accessible to CPU via API (read-write)

CUDA uses the following declspecs for variables:

Declares that a function is compiled to, and executes on the device

__device__ __host__ __global__

Callable only from another function on the device

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

Declares that a function is compiled to and executes on the host Callable only from another the host Functions without any CUDA declspec are host by default Can use __host__ and __device__ together

Declares that a function is compiled to and executes on the device Callable from the host Used as the entry point from host to device

CUDA provides a set of built-in vector types: char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4

Can construct a vector type with special function: make_{typename}(v0, v1, ) Can access elements of a vector type with .x, .y, .z, .w: vecvar.x

dim3 is a special vector type Same as uint3, except can be constructed from a scalar to form a vector: (scalar, 1, 1)

CUDA provides four global, built-in variables threadIdx, blockIdx, blockDim, gridDim Typed as a dim3 or uint3 Accessible only from device code Cannot take address Cannot assign value

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

CUDA provides syntactic sugar to launch the execution of kernels


Func<<<GridDim, BlockDim>>>(Arguments, )

Func<<<GridDim, BlockDim>>>(Arguments, )

Func is a __global__ function

Func<<<GridDim, BlockDim>>>(Arguments, )

Func<<<GridDim, BlockDim>>>(Arguments, )

GridDim is a dim3 typed expression giving the size of the grid (i.e. problem domain)

BlockDim is a dim3 typed expression giving the size of a thread block

Func<<<GridDim, BlockDim>>>(Arguments, )

CUDA defines a language that is similar to C/C++

The compiler turns this type of statement into a block of code that configures, and launches the kernel

Important Differences:
Runtime Library Functions Classes, Structs, Unions

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

Code that runs on the device cant use normal C/C++ Runtime Library functions No printf, fread, malloc, etc

There are a number of device specific functions/intrinsics available:


__syncThreads __mul24 atomicAdd, atomicCAS, atomicMin,

Most math functions have device equivalent

There is no malloc or free function that can be called from device code How can we allocate memory?

There is no malloc or free function that can be called from device code How can we allocate memory? From the host Using CUDA 1.1 atomics, write a custom allocator

On a CUDA device, there is no stack By default, all function calls are inlined Can use __noinline__ to prevent (CUDA 1.1)

CUDA supports some C++ features for device code. E.g:


Template functions

All local variables, function arguments are stored in registers NO function recursion

Classes are supported inside .cu source, but must be host only Structs/Unions work on device code as per C

No function pointers

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

CUDA source files end in .cu Contain a mix of device and host code/data Compiled by nvcc nvcc is really a wrapper around a more complex compilation process

Input Normal .c, .cpp source files CUDA .cu source code files Output Object/executable code for host .cubin executable code for the device

For .c and .cpp files, nvcc invokes the native C/C++ compiler for the system (eg: gcc/cl) For .cu files, it is a little more complicated

.cu

cpp
.cu

To see the steps performed by nvcc, use the --dryrun and --keep command line options

cudafe

.c

cpp cpp

.0

linker linker

.stub.c

.0

.gpu.c

nvopencc

.ptx

ptxas

.cubin

cubin

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

How is a .cubin linked with the rest of the program? Can be:
Loaded as a file at runtime Embedded in data segment Embedded as a resource

Programming is fun, until


The program crashes It produces the wrong result

CUDA programming is even less fun


There is no debugger There is no printf

But, there are many debugging techniques


Debugging software (eg: gdb, Visual Studio) printf

CUDA programming is even less fun


There is no debugger There is no printf

By using a compiler flag, you can emulate by running all code on the host Compiler Flag: --device-emulation

Debugging code on the device is very hard


Can try to write intermediate results to memory

Good for most debugging: can use gdb/printf

and copy back to host to examine


Emulation mode

EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008

2/19/2008

By using a compiler flag, you can emulate by running all code on the host Compiler Flag: --device-emulation

Good for most debugging: can use gdb/printf Not a true emulation:
Race Conditions, Memory model differences, etc

10

Вам также может понравиться