Академический Документы
Профессиональный Документы
Культура Документы
407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Multi-Processor
SP SP
Constant Memory
Texture Memory
Shared Memory
Local Memory
Local Memory
Local Memory
Device Memory
A kernel is executed as a grid A grid is a collection of thread blocks A thread block is a collection of threads
Thread blocks and threads are given unique identifiers Identifiers be 1D, 2D or 3D
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Tasks Install the CUDA runtime, and CUDA SDK Get it working! Write a simple CUDA program: compute sum, inner and outer products of a large vector.
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Allows programmers to easily move existing code to CUDA Lessens learning curve
Variables Functions
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Declares that a global variable is stored on the device The data resides in global memory Has lifetime of the entire application Accessible to all GPU threads Accessible to the CPU via API
Declares that a global variable is stored on the device The data resides in shared memory Has lifetime of the thread block Accessible to all threads, one copy per thread block
If not declared as volatile, reads from different threads are not visible unless a synchronization barrier used Not accessible from CPU
Declares that a global variable is stored on the device The data resides in constant memory Has lifetime of entire application Accessible to all GPU threads (read only) Accessible to CPU via API (read-write)
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Declares that a function is compiled to and executes on the host Callable only from another the host Functions without any CUDA declspec are host by default Can use __host__ and __device__ together
Declares that a function is compiled to and executes on the device Callable from the host Used as the entry point from host to device
CUDA provides a set of built-in vector types: char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4
Can construct a vector type with special function: make_{typename}(v0, v1, ) Can access elements of a vector type with .x, .y, .z, .w: vecvar.x
dim3 is a special vector type Same as uint3, except can be constructed from a scalar to form a vector: (scalar, 1, 1)
CUDA provides four global, built-in variables threadIdx, blockIdx, blockDim, gridDim Typed as a dim3 or uint3 Accessible only from device code Cannot take address Cannot assign value
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Func<<<GridDim, BlockDim>>>(Arguments, )
Func<<<GridDim, BlockDim>>>(Arguments, )
Func<<<GridDim, BlockDim>>>(Arguments, )
GridDim is a dim3 typed expression giving the size of the grid (i.e. problem domain)
Func<<<GridDim, BlockDim>>>(Arguments, )
The compiler turns this type of statement into a block of code that configures, and launches the kernel
Important Differences:
Runtime Library Functions Classes, Structs, Unions
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
Code that runs on the device cant use normal C/C++ Runtime Library functions No printf, fread, malloc, etc
There is no malloc or free function that can be called from device code How can we allocate memory?
There is no malloc or free function that can be called from device code How can we allocate memory? From the host Using CUDA 1.1 atomics, write a custom allocator
On a CUDA device, there is no stack By default, all function calls are inlined Can use __noinline__ to prevent (CUDA 1.1)
All local variables, function arguments are stored in registers NO function recursion
Classes are supported inside .cu source, but must be host only Structs/Unions work on device code as per C
No function pointers
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
CUDA source files end in .cu Contain a mix of device and host code/data Compiled by nvcc nvcc is really a wrapper around a more complex compilation process
Input Normal .c, .cpp source files CUDA .cu source code files Output Object/executable code for host .cubin executable code for the device
For .c and .cpp files, nvcc invokes the native C/C++ compiler for the system (eg: gcc/cl) For .cu files, it is a little more complicated
.cu
cpp
.cu
To see the steps performed by nvcc, use the --dryrun and --keep command line options
cudafe
.c
cpp cpp
.0
linker linker
.stub.c
.0
.gpu.c
nvopencc
.ptx
ptxas
.cubin
cubin
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
How is a .cubin linked with the rest of the program? Can be:
Loaded as a file at runtime Embedded in data segment Embedded as a resource
By using a compiler flag, you can emulate by running all code on the host Compiler Flag: --device-emulation
EN600.407 Spring 2008, Johns Hopkins University Copyright Matthew Bolitho 2008
2/19/2008
By using a compiler flag, you can emulate by running all code on the host Compiler Flag: --device-emulation
Good for most debugging: can use gdb/printf Not a true emulation:
Race Conditions, Memory model differences, etc
10