Вы находитесь на странице: 1из 11

Lecture 2.

Kernel-based Parallel Programming


- Control Divergence

Objective

To understand the implications of


warp scheduling on control flow
Nature of control flow
Warp partitioning
Control divergence

Warps as Scheduling Units


Block 1 Warps

t0 t1 t2 t31

2 Warps
t0Block
t1 t2 t31

Block 1 Warps

t0 t1 t2 t31

Each Block is divided into 32thread Warps

An implementation decision,
not part of the CUDA
programming model
Warps are scheduling units
in SM
Threads in a warp execute
in SIMD

Back to Von-Neumann

Every instruction needs to be fetched


from memory, decoded, then executed.
Instructions come in three flavors:
Operate, Data transfer, and Program
Control Flow.
An example instruction cycle is the
following:
Fetch | Decode | Execute | Memory

Control Flow Operations

Example of control flow instruction:


BRp #-4
if the condition is positive, jump
back four instructions

How thread blocks are partitioned

Thread blocks are partitioned into


warps

Partitioning is always the same

Thread IDs within a warp are consecutive


and increasing
Warp 0 starts with Thread ID 0
Thus you can use this knowledge in
control flow
However, the exact size of warps may
change from generation to generation
(Covered next)

DO NOT rely on any ordering within or


between warps

If there are any dependencies between


threads, you must __syncthreads() to get
correct results (more later).

Control Flow Instructions

Main performance concern with


branching is divergence
Threads within a single warp
take different paths
Different execution paths are
serialized in current GPUs

The control paths taken by the


threads in a warp are traversed
one at a time until there is no
more.

Control Divergence Exampels

Divergence can arise only


when branch condition is a
function of thread indices
Example with divergence:

Example without divergence:

If (threadIdx.x > 2) { }
This creates two different control
paths for threads in a block
Branch granularity < warp size;
threads 0, 1 and 2 follow different
path than the rest of the threads
in the first warp

If (blockIdx.x > 2) { }
Branch granularity is a multiple
of blocks size; all threads in
any given warp follow the same
path

Example: Vector Addition Kernel

Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise
addition

__global__
void vecAddKernel(float* A, float* B, float*
C, int n)
{

int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
9

Analysis for vector size of 1,000 elements

All threads in Blocks 0, 1, and 2 are


within valid range
i values from 0 to 767
Block 3 will have control divergence
1st group i values from 768-999
2nd group i values from 1000-1023
Performance penalty to Blocks with no
divergence is very small
Only the last Block will have
divergence
Overall performance penalty is small
as long as there are large number of
elements.

TO LEARN MORE,READ
SECTION 4.3

Вам также может понравиться