Вы находитесь на странице: 1из 12

CUDA Programming Basics

Outline
• First CUDA Program
• Execution Configuration
• Kernel Launch
• Massively Parallel Hardware
• Parallel Execution Model
CUDA Programming
C Programming Model
Model
Host Host Memory
CPU RAM User I/O

A A
A

A A
GPU DRAM
Device Device Memory
Graphics Card
First CUDA Program
• Problem-
– Write a program in CUDA to find square of first
500 whole numbers stored in an array.
– Serial implementation-
#include <stdio.h>
int main()
{
int *a, i, N=500;
a = (int*) malloc (sizeof(int) * N);
for(i=0; i<N; i++) a[i] = i;
for(i=0; i<N; i++) a[i] = a[i] * a[i];
for(i=0; i<N; i++)
printf(“Square of %d = %d\n”, i, a[i]);
return 1;
}
Parallel Implementation
#include <stdio.h> Host Memory
#include <cuda.h> RAM
int main()
{ ah
int *ad, *ah, i, N=500;

//allocate memory on host and device


ah = (int*) malloc (sizeof(int) * N);
cudaMalloc((void**) &ad, (sizeof(int) * N));
ad
for(i=0; i<N; i++) ah[i] = i;
DRAM
//copy data from host to device Device Memory
cudaMemcpy(ad, ah, sizeof(int) * N, cudaMemcpyHostToDevice);

//launch CUDA Kernel


find_square <<< 1, N >>> (ad, N); Contd…
//copy data from device to host
cudaMemcpy(ah, ad, sizeof(int) * N, cudaMemcpyDeviceToHost);

for(i=0; i<N; i++)


printf(“Square of %d = %d\n”, i, ah[i]);
return 1;
}
Host Memory
RAM User I/O
ah A

ad

DRAM
Device Memory
Execution Configuration
find_square <<< 1, N >>> (ad, N);

Dimension & Size Dimension & Size


of blocks in grid of threads in a block
(1, 1) (N, 1, 1)
Grid of Blocks
Block of Threads
Threads

threadIdx.x 0 1 2 3 4 N-2 N-1


threadIdx.y 0 0 0 0 0 0 0
threadIdx.z 0 0 0 0 0 0 0
blockIdx.x 0 0 0 0 0 0 0
bloxkIdx.y 0 0 0 0 0 0 0
Kernel Code
__global__ void find_square
void find_square (intN)*ad, int N)
(int *ad, int
{
int index = threadIdx.x;
if(index < N)
ad[index] = ad[index] * ad[index];
}

__host__ __device__

__global__
CPU GPU
More on Execution Configuration
dim3 grid_spec(4, 3);
dim3 block_spec(2, 2, 1);
my_function <<< grid_spec, block_spec >>> ();

gridDim.x blockDim.x

blockDim.y
gridDim.y

Grid of Blocks Block of Threads


Massively Parallel Hardware

MP0 SP0 SP0


S SP0 SP0 S
MP1 F F
U SP0 SP0 U
GPU
SP0 SP0

Shared Memory
MPn

GeForce 9600 GT has 8 Multiprocessors(MP) = 8 * 8 = 64 cores(SP)


GeForce GTX 260 has 24 Multiprocessors(MP) = 24 * 8 = 192 cores(SP)
Parallel Execution Model
• Threads are executed by Thread Processor(Streaming Processor/Core).

SP

• Thread blocks are executed on Multiprocessors. Smallest scheduling unit


is warp.

MP

• Kernel is launched as grid of thread blocks.

GPU
Queries ???

Вам также может понравиться