Gpu and Cuda

The “New” Moore’s Law
• Computers no longer get faster, just wider
• You must re-think your algorithms to be parallel !
• Data-parallel computing is most scalable solution

The “New” Moore’s Law
Enter the GPU
• Massive economies of scale
• Massively parallel
Graphical processors
• The graphics processing unit (GPU) on commodity
video cards has evolved into an extremely flexible
and powerful processor
 Programmability
 Precision
 Power
• GPGPU: an emerging field seeking to harness GPUs
for general-purpose computation
5
Parallel Computing on a GPU
• 8-series GPUs deliver 25 to 200+ GFLOPS

on compiled parallel C applications
 Available in laptops, desktops, and clusters GeForce 8800
• GPU parallelism is doubling every year

• Programming model scales transparently
Tesla D870
• Multithreaded SPMD model uses application
data parallelism and thread parallelism
Tesla S870
Computational Power
• GPUs are fast…
 3.0 GHz dual-core Pentium4: 24.6 GFLOPS
 NVIDIA GeForceFX 7800: 165 GFLOPs
 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s
 ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
• GPUs are getting faster, faster
 CPUs: 1.4× annual growth
 GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth
7
CPU vs GPU
Flexible and Precise
• Modern GPUs are deeply programmable
 Programmable pixel, vertex, video engines
 Solidifying high-level language support
• Modern GPUs support high precision
 32 bit floating point throughout the pipeline
 High enough for many (not all) applications
9
GPU for graphics
• GPUs designed for & driven by video games
 Programming model unusual
 Programming idioms tied to computer graphics
 Programming environment tightly constrained
• Underlying architectures are:
 Inherently parallel
 Rapidly evolving (even in basic feature set!)
 Largely secret
10
General purpose GPUs
• The power and flexibility of GPUs makes them an
attractive platform for general-purpose computation
• Example applications range from in-game physics
simulation to conventional computational science
• Goal: make the inexpensive power of the GPU
available to developers as a sort of computational
coprocessor
11
Previous GPGPU Constraints
• Dealing with graphics API
 Working with the corner cases of the Input Registers
per thread
per Shader
per Context
graphics API
Fragment Program
• Addressing modes Texture
 Limited texture size/dimension Constants
Temp Registers
• Shader capabilities
 Limited outputs Output Registers
• Instruction sets FB Memory
 Lack of Integer & bit ops

• Communication limited
 Between pixels
Enter CUDA
• Scalable parallel programming model
• Minimal extensions to familiar C/C++ environment
• Heterogeneous serial-parallel computing

Sound Bite
GPUs + CUDA
=
The Democratization of Parallel Computing
Massively parallel computing has become

a commodity technology
MOTIVATION
146X 36X 19X 17X 100X
Interactive Ionic placement for Transcoding HD video Fluid mechanics in Astrophysics N-body
visualization of molecular dynamics stream to H.264 Matlab using .mex file simulation
volumetric white simulation on GPU CUDA function
matter connectivity
149X 47X 20X 24X 30X
Financial simulation GLAME@lab: an M- Ultrasound medical Highly optimized Cmatch exact string
of LIBOR model with script API for GPU imaging for cancer object oriented matching to find
swaptions linear algebra diagnostics molecular dynamics similar proteins and
gene sequences
4.6 Days
2.7 Days 3 Hours
8 Hours
30 Minutes
27 Minutes 16 Minutes
13 Minutes
Computational Neurological Cell Phone RF 3D CT

Chemistry Modeling Simulation Ultrasound
CPU Only Heterogeneous with Tesla GPU

GPUs: Turning Point in Supercomputing
Desktop beats Cluster
4 GPUs
vs
256 CPUs
Tesla Personal
CalcUA Supercomputer
$5 Million $10,000
Source: University of Antwerp, Belgium

CUDA: ‘C’ FOR PARALLELISM
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y); Standard C Code
__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Parallel C Code
So far, today..
• GPU – powerful coprocessors
• CUDA – programming model for GPU
• Easier to parallelize on GPUs
• CUDA extends GPU to general purpose computing
• Now we shall look at the thread programming and

memory structure on GPU
Hierarchy of concurrent threads
Thread t
• Parallel kernels composed of many threads
 all threads execute the same sequential program
Block b
t0 t1 … tB
• Threads are grouped into thread blocks
 threads in the same block can cooperate
• Threads/blocks have unique IDs

Kernel foo()
...
Hierarchical organization
Block
Thread
per-block
per-thread shared
local memory Local barrier memory
Kernel 0
...
Global barrier per-device
global
Kernel 1
memory
...
Heterogeneous Programming
• CUDA = serial program with parallel kernels, all in C
 Serial C code executes in a CPU thread
 Parallel kernel C code executes in thread blocks
across multiple processing elements
Serial Code
Parallel Kernel
..
foo<<< nBlk, nTid
.
>>>(args);
Serial Code
Parallel Kernel
...
bar<<<nBlk, nTid>>>(args);
What is a thread?
• Independent thread of execution
 has its own PC, variables (registers), processor state,
etc.
 no implication about how threads are scheduled
• CUDA threads might be physical threads
 as on NVIDIA GPUs
• CUDA threads might be virtual threads
 might pick 1 block = 1 physical thread on multicore
CPU
What is a thread block?
• Thread block = virtualized multiprocessor
 freely choose processors to fit data
 freely customize for each kernel launch
• Thread block = a (data) parallel task

 all blocks in kernel have the same entry point
 but may execute any code they want
• Thread blocks of kernel must be independent tasks

 program valid for any interleaving of block executions
Blocks must be independent
• Any possible interleaving of blocks should be valid
 presumed to run to completion without pre-emption
 can run in any order
 can run concurrently OR sequentially
• Blocks may coordinate but not synchronize

 shared queue pointer: OK
 shared lock: BAD … can easily deadlock
• Independence requirement gives scalability

Levels of parallelism
• Thread parallelism
 each thread is an independent thread of execution
• Data parallelism
 across threads in a block
 across blocks in a kernel
• Task parallelism
 different blocks are independent
 independent kernels
Block = virtualized multiprocessor
• Provides programmer flexibility
 freely choose processors to fit data
 freely customize for each kernel launch
• Thread block = a (data) parallel task

 all blocks in kernel have the same entry point
 but may execute any code they want
• Thread blocks of kernel must be independent tasks

 program valid for any interleaving of block executions
Scalable Execution Model
Kernel launched by host
...
Blocks Run on Multiprocessors
MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU
SP SP SP SP SP SP SP SP
...
Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory
Device Memory
Synchronization & Cooperation
• Threads within block may synchronize with barriers
… Step 1 …
__syncthreads();
… Step 2 …
• Blocks coordinate via atomic memory operations
 e.g., increment shared queue pointer with
atomicInc()
• Implicit barrier between dependent kernels
vec_minus<<<nblocks, blksize>>>(a,
b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
CUDA Memories
G80 Implementation of CUDA Memories
• Each thread can:
Grid
 Read/write per-thread registers
 Read/write per-thread local Block (0, 0) Block (1, 0)
memory Shared Memory Shared Memory
 Read/write per-block shared Registers Registers Registers Registers

memory
 Read/write per-grid global Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
memory
 Read/only per-grid constant Host Global Memory
memory
Constant Memory
• The host can R/W global,
constant, and texture memories
31
Block
Thread
Shared
Local Memory Memory
Grid 0
...
Global Sequential
Grid 1 Memory Grids
in Time
...
32
Memory model
Device 0
memory
Host memory
Device 1
memory
A Common Programming Strategy
• Global memory resides in device memory (DRAM) - much
slower access than shared memory
• So, a profitable way of performing computation on the device is
to tile data to take advantage of fast shared memory:
 Partition data into subsets that fit into shared memory
 Handle each data subset with one thread block by:
 Loading the subset from global memory to shared memory, using
multiple threads to exploit memory-level parallelism
 Performing the computation on the subset from shared memory; each
thread can efficiently multi-pass over any data element
 Copying results from shared memory to global memory
34
A Common Programming Strategy (Cont.)
• Constant memory also resides in device memory
(DRAM) - much slower access than shared memory
 But… cached!
 Highly efficient access for read-only data
• Carefully divide data according to access patterns
 R/Only  constant memory (very fast if in cache)
 R/W shared within Block  shared memory (very fast)
 R/W within each thread  registers (very fast)
 R/W inputs/results  global memory (very slow)
35
Is that all??
• No!!
• Memory Coalescing
• Bank conflicts
Memory Coalescing
• When accessing global memory, peak performance
utilization occurs when all threads access continuous
memory locations.
Not coalesced coalesced
Md Nd
Thread 1
WIDTH
Thread 2
WIDTH
Memory Layout of a Matrix in C
Access M0,0 M1,0 M2,0 M3,0
direction in M0,1 M1,1 M2,1 M3,1
Kernel
code M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3
Time Period 1 Time Period 2 …

T1 T2 T3 T4 T1 T2 T3 T4
M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
Memory Layout of a Matrix in C
Access M0,0 M1,0 M2,0 M3,0
direction in M0,1 M1,1 M2,1 M3,1
Kernel
code M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3

…
Time Period 2
T1 T2 T3 T4
Time Period 1
T1 T2 T3 T4
M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
Parallel Memory Architecture for Shared memory
• In a parallel machine, many threads access memory
 Therefore, memory is divided into banks
 Essential to achieve high bandwidth
Bank 0
Bank 1
• Each bank can service one address per cycle Bank 2
 A memory can service as many simultaneous Bank 3
Bank 4
accesses as it has banks Bank 5
Bank 6
Bank 7
• Multiple simultaneous accesses to a bank
result in a bank conflict Bank 15
 Conflicting accesses are serialized
Bank Addressing Examples
• No Bank Conflicts • No Bank Conflicts

 Linear addressing  Random 1:1 Permutation
stride == 1
Thread 0 Bank 0 Thread 0 Bank 0

Bank Addressing Examples
• 2-way Bank Conflicts • 8-way Bank Conflicts
 Linear addressing  Linear addressing
stride == 2 stride == 8
x8
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Thread 8 x8
Thread 9
Thread 10
Summary of CUDA programming tips
• Divide the overall task between concurrent non-
communicating threads.
• Design a coalesced access of global memory
• Avoid bank-conflicts when accessing shared memory
Programming on CUDA
Basic steps
• Transfer data from CPU to GPU
• Explicitly call the GPU kernel designed
 CUDA will implicitly assign threads to each
multiprocessor, and assign resources for computations
• Transfer results back from GPU to CPU
CPU vs GPU
• CPU – operation intensive
 Goal: reduce number of operations performed at the
expense of additional memory access
• GPU – memory intensive
 Goal: reduce the number of memory accesses at the
expense of additional operations.
Memory model
Device 0
memory
Host memory cudaMemcpy()
Device 1
memory
CUDA: Minimal extensions to C/C++
• Declaration specifiers to indicate where things live
__global__ void KernelFunc(...); // kernel callable from host
__device__ void DeviceFunc(...); // function callable on device
__device__ int GlobalVar; // variable in device memory
__shared__ int SharedVar; // in per-block shared memory
• Extend function invocation syntax for parallel kernel launch

KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each
• Special variables for thread identification in kernels

dim3 threadIdx; dim3 blockIdx; dim3 blockDim;
• Intrinsics that expose specific operations in kernel code

__syncthreads(); // barrier synchronization
CUDA: Features available on GPU
• Standard mathematical functions
 sinf,
 powf,
 atanf,
 ceil,
 min,
 sqrtf,
 expf
 erfc
 And many more standard mathematical functions
CUDA: Runtime support
• Explicit memory allocation returns pointers to GPU memory
cudaMalloc(), cudaFree()
• Explicit memory copy for host ↔ device, device ↔ device

cudaMemcpy(), cudaMemcpy2D(), ...
• Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
• OpenGL & DirectX interoperability

cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory

float *d_A, *d_B, *d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device

cudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );
// execute the kernel on N/256 blocks of 256 threads each

vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
CUDA libraries – CUFFT & CUBLAS
CUBLAS and CUFFT
• Standard libraries with development kit
• CUBLAS
 CUDA version of blas
 Available for single- and double- precision for real and
complex numbers
 Double version – slower
• CUFFT
 FFT & IFFT on CUDA
 Faster than the fastest CPU algorithm
CUBLAS
• 3 classes
1. Vector operations
 Vector addition, norm, dot-product, etc
2. Matrix-vector operations
 Matrix-vector product for symmetric and normal
matrices, etc
3. Matrix-matrix operations
 Matrix multiplication, etc
Advantage
• Highly optimized design
• Usable as standard C/C++/Fortran libraries
• Caters the needs in many scientific computing tasks

OpenCL
CUDA: An Architecture for
Massively Parallel Computing
ATI’s Compute “Solution”

OpenCL vs. C for CUDA
Entry point for

C for CUDA developers who
prefer high-level C
Entry point for

developers who OpenCL
want low-level API
Shared back-end compiler

& optimization technology PTX
GPU
Recall: GPU and CUDA
• GPU – developed for accelerating graphics
• CUDA – developed to harness the power of GPUs for
general purpose applications
 Like C in syntax
• GPU – not a panacea
 Used in a master-slave scenario with CPU (host) as
master
Recall: GPU memories
• Each thread can:
 Read/write per-thread registers Grid
 Read/write per-thread local memory
 Read/write per-block shared memory Block (0, 0) Block (1, 0)
 Read/write per-grid global memory

Shared Memory Shared Memory
 Read/only per-grid constant memory
Registers Registers Registers Registers
• Divide data according to access
patterns
 R/Only  constant memory (very fast Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
if in cache)
 R/W shared within Block  shared Host Global Memory
memory (very fast)

Constant Memory
 R/W within each thread  registers
(very fast)
 R/W inputs/results  global memory
(very slow)
62
Recall: Thread organization
Block
Thread
Shared
Local Memory Memory
Grid 0
...
Global Sequential
Grid 1 Memory Grids
in Time
...
63
Recall: Heterogeneous programming
\\ CPU codes
cudaMalloc() \\ allocate memories on device
cudaMemcpy() \\ transfer input data to device
Kernel<<blocks,threads>>() \\ call cuda kernels
\\ kernels are functions evaluated on a single thread
cudaMemcpy() \\ transfer results from device
Keywords: __global__, __shared__, __device__

Special math functions: sinf, expf, min, etc
Case Study: Matrix Multiplication
Matrix Multiplication Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
How about performance on G80?
• All threads access global memory for
Grid
their input matrix elements
 Two memory accesses (8 bytes) per Block (0, 0) Block (1, 0)
floating point multiply-add
 4B/s of memory bandwidth/FLOPS Shared Memory Shared Memory
 4*346.5 = 1386 GB/s required to
Registers Registers Registers Registers
achieve peak FLOP rating
 86.4 GB/s limits the code at 21.6
GFLOPS Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
• The actual code runs at about 15

GFLOPS Host Global Memory
• Need to drastically cut down memory
accesses to get closer to the peak 346.5 Constant Memory
GFLOPS
Use Shared Memory to reuse global
N
memory data
• Each input element is
WIDTH
read by Width threads.
• Load each element into
Shared Memory and have
several threads use the M P
local version to reduce ty

the memory bandwidth
WIDTH
 Tiled algorithms tx
WIDTH WIDTH 68
bx
Tiled Multiply 0 1 2
tx
01 2 TILE_WIDTH-1
• Break up the execution of the
TILE_WIDTH TILE_WIDTH
Nd
kernel into phases so that the data
WIDTH
accesses in each phase is focused
on one subset (tile) of Md and Nd
Md Pd
TILE_WIDTHE
0 Pdsub
WIDTH
1
2
by 1
ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
A Small Example
Nd0,0Nd1,0
Nd0,1Nd1,1
Nd0,2Nd1,2
Nd0,3Nd1,3
Md0,0Md1,0Md2,0Md3,0 Pd0,0Pd1,0Pd2,0Pd3,0
Pd0,2Pd1,2Pd2,2Pd3,2
© David Kirk/NVIDIA and Wen- 70

mei W. Hwu, 2007-2009
Every Md and Nd Element is used exactly
twice in generating a 2X2 tile of P
P0,0 P1,0 P0,1 P1,1
thread0,0 thread1,0 thread0,1 thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0
Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

order
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2
M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

Breaking Md and Nd into Tiles
Nd0,0Nd1,0
Nd0,1Nd1,1
Nd0,2Nd1,2
Nd0,3Nd1,3
First-order Size Considerations in G80
• Each thread block should have many threads
 TILE_WIDTH of 16 gives 16*16 = 256 threads
• There should be many thread blocks

 A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
• Each thread block perform 2*256 = 512 float loads

from global memory for 256 * (2*16) = 8,192
mul/add operations.
 Memory bandwidth no longer a limiting factor
CUDA Code – Kernel Execution Configuration
// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11. __syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)
12. Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
14. }
13. Pd[Row*Width+Col] = Pvalue;}
Tiled Multiply 0
bx
1 2
tx
• Each block computes one square 01 2 TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
Nd
sub-matrix Pdsub of size
m
TILE_WIDTH
WIDTH
• Each thread computes one bx k
element of Pdsub
Md Pd
0
by
TILE_WIDTHE
0 Pdsub
WIDTH
ty
1
2 m
by 1
TILE_WIDTH-1
k
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
 SM size is implementation dependent!
 For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of
shared memory.
 Can potentially have up to 8 Thread Blocks actively executing
 This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads
per block)
 The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared
memory usage per thread block, allowing only up to two thread blocks
active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
 The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
77
GFLOPS
0
10
20
30
40
50
60
70
80
90
100
not tiled
tiled
only
4x4 tiles
tiled &
unrolled
tiled
Tiling Size Effects
only
8x8 tiles
tiled &
unrolled
tiled
only
tiled &
12x12 tiles
unrolled
tiled
only
tiled &
16x16 tiles
unrolled
Typical Structure of a CUDA Program
• Global variables declaration
 __host__
 __device__... __global__, __constant__, __texture__
• Function prototypes
 __global__ void kernelOne(…)
 float handyFunction(…)
• Main ()
 allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
 transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)
 execution configuration setup
 kernel call – kernelOne<<<execution configuration>>>( args… ); repeat
 transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…)
as
 optional: compare against golden (host computed) solution
• Kernel – void kernelOne(type args,…)
needed
 variables declaration - __local__, __shared__
 automatic variables transparently assigned to registers or local memory
 syncthreads()…
• Other functions
 float handyFunction(int inVar…);
GPU for Machine learning
Machine learning
• With improved sensors, the amount of data available
has increased by several folds over the past decade.
• Also, more robust and sophisticated learning
algorithms have been developed to extract
meaningful information from the data
• This has resulted in the application of these
algorithms in many areas:
 Geostatistics, astronomical predictions, weather data
assimilations, computational finance.
Extracting information from the data
• “Extracting information from the data” means converting the
raw data to an interpretable version
 For example, given a face image, it would desirable to
extract the identity of the person, the face pose, etc
• Information extraction categories
 Regression – [fitting a continuous function]
 Classification – [classify into one of the predefined classes]
 Density estimation – [evaluating the class membership]
 Ranking – [preference relationships between classes]
• Bottom-line: Infer the relationships based on the data
 Build the relationship model from the data
Relationship modeling
• There are two primary categories of the models
 Parametric
 Non-parametric
• Parametric model:
 Assumes a known parametric form of the “relationship”
 Estimates the parameters of this “form” from the data
• Non-parametric model
 Do not make any assumptions on the form of the
underlying function.
 “Letting the data speak for itself”
Kernel methods
• A class of robust non-parametric learning methods
• Projects the data into a higher dimensional space
• Formulates the problem such that only the inner product of the higher
dimension features are required
• The inner-products are given by the kernel functions
• For example the Gaussian kernel is given by:
Scalable learning methods
• Most of these kernel based learning approaches scale O(N2) or O(N3) in
time with respect to data
• There is also O(N2) memory requirement in many of these

• This is undesirable for very large datasets
• We would like to develop a parallelized version on GPU
Kernel methods on GPUs
• There are problems where summations of kernel
functions need to be evaluated
 Algorithm must map summation to multiple threads
• Some problems require the solution of linear system
involving kernel matrices
 Possibly use the kernel summation before with popular
iterative approaches like conjugate gradient
• There also exist problems where popular matrix
decompositions like LU needed to performed for
kernel matrices
 Number of approaches already exist on GPUs
Solving Ky=b
• Can use iterative methods
 Conjugate Gradient
• Over each iteration, to evaluate Kx
• We will discuss the matrix-vector product now

Kernel matrix – special structure
• O(N) dependence
 N x N matrix depends on O(N)-length vector
• Need only O(N) space
• Need to exploit this to minimize space requirements

Kernel summation on GPU
• Data:
 Source points xi, i=1,…,N,
 Evaluation points yj, j=1,…,M $
• Each thread evaluates the sum corresponding to one evaluation point:
• Algorithm:
1. Load evaluation point corresponding to the current thread in to alocal
register.
2. Load the first chunk of source data to the shared memory.
3. Evaluate part of kernel sum corresponding to sourcedata in the shared
memory.
4. Store the result in a local register.
5. If all the source points have not been processed yet, load the next chunk,
go to Step 3.
6. Write the sum in the local register to the global memory.
Gaussian kernel on GPU
float sum=0.0; __shared__ float hs[DIM]; volatile float yr[DIM];
for (int k=0;k<DIM;k++){
int indexM=k+(blockIdx.x*BLOCK_SIZE + threadIdx.x)*DIM;
yr[k]=y[indexM];}
// load ‘h’
for (int b=0;b<N;b+=BLOCK_SIZE){
__shared__ float qs[BLOCK_SIZE];
__shared__ float Xs[BLOCK_SIZE][DIM];
// load X & q
for (int i=0;i<BLOCK_SIZE;i++){
float dist=0.0;
if ((b+i)<N){
for (int k=0;k<DIM;k++){
float tempDiff=yr[k]-Xs[i][k];
dist+=(tempDiff*tempDiff)/(hs[k]*hs[k]);
}
sum+=__expf(-dist)*qs[i];
}}
__syncthreads();
}
if ((blockIdx.x*BLOCK_SIZE + threadIdx.x)<M)
f[blockIdx.x*BLOCK_SIZE + threadIdx.x] = sum;
}
Kernels tested:
Gaussian
Matern
Periodic
Epanechnikov
Raw speedups across dimension
Applications
• Kernel density estimation
• Gaussian process regression
• Meanshift clustering
• Ranking
• And many more…
Kernel Density Estimation
• Non-parametric way of estimating probability density
function of a random variable
• Two popular kernels: Gaussian and Epanechnikov
• Accelerated with GPU based algorithm:

 Speed up ~ 450X
Results on standard distributions
• Performed KDE on 15 normal mixture densities from [1]
[1] J. S. Marron and M. P. Wand 'Exact Mean Integrated Squared

Error' The Annals of Statistics, 1992, Vol. 20, No. 2, 712-736
Gaussian Process Regression
• Non-parametric regression
 Kernel regression
 Robust in non-linear modeling
• For ; given y and x, need to model ‘f’
• Given
• Data D = {xi, yi}, i=1..N
• Test point x*, need to find f(x*) or f*
• GPR model  f*=k*(x) x (K+σ2I)-1 x y
 K = kernel matrix of training data
 k* = kernel vector of test point w.r.t all training data
• GPR model  f*=k*(x) x (K+σ2I)-1 x y
• Complexity – O(N3) due to the inversion of the kernel
matrix (can be made O(N2) using Conjugate Gradient
 A popular iterative krylov algorithm
• Popular kernels that are used in GPR are:
 Gaussian
 Matern
 Periodic
2
Performance of Gaussian Proces Regression with Gaussian kernel
10
CPU
GPU
1
10
Time taken
0
10
-1
10
-2
10
2 3 4 5
10 10 10 10
Size of data
GPR on standard datasets
GPU based kernel summation
• Still O(N2)!
• A linear approximation algorithm can beat this
beyond “some” N
• FMM – based Gaussian kernel (FIGTREE) vs GPU
version
FIGTREE vs GPU 1
FIGTREE vs GPU 2
FIGTREE vs GPU 3
Further :
• More interesting: FMM on GPU
• Issues on data structures
• Need to consider many factors.
• Will be discussed next class by Dr. Nail Gumerov

Quotes
GPUs have evolved to the point where many real world
applications are easily implemented on them and run
significantly faster than on multi-core systems.
Future computing architectures will be hybrid systems with

parallel-core GPUs working in tandem with
multi-core CPUs.
Jack Dongarra
Professor, University of Tennessee
Author of Linpack
We’ve all heard ‘desktop supercomputer’ claims in the past,
but this time it’s for real: NVIDIA and its partners will be
delivering outstanding performance and broad applicability
to the mainstream marketplace.
Heterogeneous computing is what makes such a

breakthrough possible.
Burton Smith
Technical Fellow, Microsoft
Formerly, Chief Scientist at Cray

Gpu and Cuda

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Gpu and Cuda

Загружено:

Авторское право:

Доступные форматы

The “New” Moore’s Law

• Computers no longer get faster, just wider

• You must re-think your algorithms to be parallel !

• Data-parallel computing is most scalable solution

• 8-series GPUs deliver 25 to 200+ GFLOPS

• GPU parallelism is doubling every year

 Limited texture size/dimension Constants

• Instruction sets FB Memory

 Lack of Integer & bit ops

• Minimal extensions to familiar C/C++ environment

• Heterogeneous serial-parallel computing

Massively parallel computing has become

146X 36X 19X 17X 100X

149X 47X 20X 24X 30X

Computational Neurological Cell Phone RF 3D CT

CPU Only Heterogeneous with Tesla GPU

Desktop beats Cluster

Source: University of Antwerp, Belgium

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

• Now we shall look at the thread programming and

• Threads/blocks have unique IDs

• Thread block = a (data) parallel task

• Thread blocks of kernel must be independent tasks

• Blocks may coordinate but not synchronize

• Independence requirement gives scalability

• Thread block = a (data) parallel task

• Thread blocks of kernel must be independent tasks

Blocks Run on Multiprocessors

memory Shared Memory Shared Memory

 Read/write per-block shared Registers Registers Registers Registers

Not coalesced coalesced

M0,3 M1,3 M2,3 M3,3

Time Period 1 Time Period 2 …

M0,3 M1,3 M2,3 M3,3

• No Bank Conflicts • No Bank Conflicts

Thread 15 Bank 15 Thread 15 Bank 15

• Extend function invocation syntax for parallel kernel launch

• Special variables for thread identification in kernels

• Intrinsics that expose specific operations in kernel code

• Explicit memory copy for host ↔ device, device ↔ device

• OpenGL & DirectX interoperability

// allocate device (GPU) memory

// copy host memory to device

// execute the kernel on N/256 blocks of 256 threads each

• Caters the needs in many scientific computing tasks

ATI’s Compute “Solution”

Entry point for

Entry point for

Shared back-end compiler

 Read/write per-grid global memory

memory (very fast)

Keywords: __global__, __shared__, __device__

• The actual code runs at about 15

local version to reduce ty

kernel into phases so that the data

© David Kirk/NVIDIA and Wen- 70

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

• There should be many thread blocks

• Each thread block perform 2*256 = 512 float loads

// Setup the execution configuration

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work on

• There is also O(N2) memory requirement in many of these

global void saxpy_parallel(int n, float a, float x, float y)

Keywords: global, shared, device