Академический Документы
Профессиональный Документы
Культура Документы
• Massively parallel
Graphical processors
• The graphics processing unit (GPU) on commodity
video cards has evolved into an extremely flexible
and powerful processor
Programmability
Precision
Power
• GPGPU: an emerging field seeking to harness GPUs
for general-purpose computation
5
Parallel Computing on a GPU
Tesla S870
Computational Power
• GPUs are fast…
3.0 GHz dual-core Pentium4: 24.6 GFLOPS
NVIDIA GeForceFX 7800: 165 GFLOPs
1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s
ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
• GPUs are getting faster, faster
CPUs: 1.4× annual growth
GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth
7
CPU vs GPU
Flexible and Precise
• Modern GPUs are deeply programmable
Programmable pixel, vertex, video engines
Solidifying high-level language support
• Modern GPUs support high precision
32 bit floating point throughout the pipeline
High enough for many (not all) applications
9
GPU for graphics
• GPUs designed for & driven by video games
Programming model unusual
Programming idioms tied to computer graphics
Programming environment tightly constrained
• Underlying architectures are:
Inherently parallel
Rapidly evolving (even in basic feature set!)
Largely secret
10
General purpose GPUs
• The power and flexibility of GPUs makes them an
attractive platform for general-purpose computation
• Example applications range from in-game physics
simulation to conventional computational science
• Goal: make the inexpensive power of the GPU
available to developers as a sort of computational
coprocessor
11
Previous GPGPU Constraints
• Dealing with graphics API
Working with the corner cases of the Input Registers
per thread
per Shader
per Context
graphics API
Fragment Program
• Addressing modes Texture
Temp Registers
• Shader capabilities
Limited outputs Output Registers
GPUs + CUDA
=
The Democratization of Parallel Computing
Interactive Ionic placement for Transcoding HD video Fluid mechanics in Astrophysics N-body
visualization of molecular dynamics stream to H.264 Matlab using .mex file simulation
volumetric white simulation on GPU CUDA function
matter connectivity
Financial simulation GLAME@lab: an M- Ultrasound medical Highly optimized Cmatch exact string
of LIBOR model with script API for GPU imaging for cancer object oriented matching to find
swaptions linear algebra diagnostics molecular dynamics similar proteins and
gene sequences
4.6 Days
2.7 Days 3 Hours
8 Hours
30 Minutes
27 Minutes 16 Minutes
13 Minutes
4 GPUs
vs
256 CPUs
Tesla Personal
CalcUA Supercomputer
$5 Million $10,000
...
Hierarchical organization
Block
Thread
per-block
per-thread shared
local memory Local barrier memory
Kernel 0
...
Global barrier per-device
global
Kernel 1
memory
...
Heterogeneous Programming
• CUDA = serial program with parallel kernels, all in C
Serial C code executes in a CPU thread
Parallel kernel C code executes in thread blocks
across multiple processing elements
Serial Code
Parallel Kernel
..
foo<<< nBlk, nTid
.
>>>(args);
Serial Code
Parallel Kernel
...
bar<<<nBlk, nTid>>>(args);
What is a thread?
• Independent thread of execution
has its own PC, variables (registers), processor state,
etc.
no implication about how threads are scheduled
• CUDA threads might be physical threads
as on NVIDIA GPUs
• CUDA threads might be virtual threads
might pick 1 block = 1 physical thread on multicore
CPU
What is a thread block?
• Thread block = virtualized multiprocessor
freely choose processors to fit data
freely customize for each kernel launch
• Data parallelism
across threads in a block
across blocks in a kernel
• Task parallelism
different blocks are independent
independent kernels
Block = virtualized multiprocessor
• Provides programmer flexibility
freely choose processors to fit data
freely customize for each kernel launch
...
MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU
SP SP SP SP SP SP SP SP
...
Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory
Device Memory
Synchronization & Cooperation
• Threads within block may synchronize with barriers
… Step 1 …
__syncthreads();
… Step 2 …
• Blocks coordinate via atomic memory operations
e.g., increment shared queue pointer with
atomicInc()
• Implicit barrier between dependent kernels
vec_minus<<<nblocks, blksize>>>(a,
b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
CUDA Memories
G80 Implementation of CUDA Memories
• Each thread can:
Grid
Read/write per-thread registers
Read/write per-thread local Block (0, 0) Block (1, 0)
31
Block
Thread
Shared
Local Memory Memory
Grid 0
...
Global Sequential
Grid 1 Memory Grids
in Time
...
32
Memory model
Device 0
memory
Host memory
Device 1
memory
A Common Programming Strategy
• Global memory resides in device memory (DRAM) - much
slower access than shared memory
• So, a profitable way of performing computation on the device is
to tile data to take advantage of fast shared memory:
Partition data into subsets that fit into shared memory
Handle each data subset with one thread block by:
Loading the subset from global memory to shared memory, using
multiple threads to exploit memory-level parallelism
Performing the computation on the subset from shared memory; each
thread can efficiently multi-pass over any data element
Copying results from shared memory to global memory
34
A Common Programming Strategy (Cont.)
• Constant memory also resides in device memory
(DRAM) - much slower access than shared memory
But… cached!
Highly efficient access for read-only data
• Carefully divide data according to access patterns
R/Only constant memory (very fast if in cache)
R/W shared within Block shared memory (very fast)
R/W within each thread registers (very fast)
R/W inputs/results global memory (very slow)
35
Is that all??
• No!!
• Memory Coalescing
• Bank conflicts
Memory Coalescing
• When accessing global memory, peak performance
utilization occurs when all threads access continuous
memory locations.
Md Nd
Thread 1
WIDTH
Thread 2
WIDTH
Memory Layout of a Matrix in C
Access M0,0 M1,0 M2,0 M3,0
direction in M0,1 M1,1 M2,1 M3,1
Kernel
code M0,2 M1,2 M2,2 M3,2
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
Memory Layout of a Matrix in C
Access M0,0 M1,0 M2,0 M3,0
direction in M0,1 M1,1 M2,1 M3,1
Kernel
code M0,2 M1,2 M2,2 M3,2
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
Parallel Memory Architecture for Shared memory
• In a parallel machine, many threads access memory
Therefore, memory is divided into banks
Essential to achieve high bandwidth
Bank 0
Bank 1
• Each bank can service one address per cycle Bank 2
A memory can service as many simultaneous Bank 3
Bank 4
accesses as it has banks Bank 5
Bank 6
Bank 7
• Multiple simultaneous accesses to a bank
result in a bank conflict Bank 15
Conflicting accesses are serialized
Bank Addressing Examples
Device 0
memory
Host memory cudaMemcpy()
Device 1
memory
CUDA: Minimal extensions to C/C++
• Declaration specifiers to indicate where things live
__global__ void KernelFunc(...); // kernel callable from host
__device__ void DeviceFunc(...); // function callable on device
__device__ int GlobalVar; // variable in device memory
__shared__ int SharedVar; // in per-block shared memory
• Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
GPU
Recall: GPU and CUDA
• GPU – developed for accelerating graphics
• CUDA – developed to harness the power of GPUs for
general purpose applications
Like C in syntax
• GPU – not a panacea
Used in a master-slave scenario with CPU (host) as
master
Recall: GPU memories
• Each thread can:
Read/write per-thread registers Grid
Read/write per-thread local memory
Read/write per-block shared memory Block (0, 0) Block (1, 0)
if in cache)
R/W shared within Block shared Host Global Memory
Grid 0
...
Global Sequential
Grid 1 Memory Grids
in Time
...
63
Recall: Heterogeneous programming
\\ CPU codes
cudaMalloc() \\ allocate memories on device
cudaMemcpy() \\ transfer input data to device
Kernel<<blocks,threads>>() \\ call cuda kernels
\\ kernels are functions evaluated on a single thread
cudaMemcpy() \\ transfer results from device
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
How about performance on G80?
• All threads access global memory for
Grid
their input matrix elements
Two memory accesses (8 bytes) per Block (0, 0) Block (1, 0)
floating point multiply-add
4B/s of memory bandwidth/FLOPS Shared Memory Shared Memory
4*346.5 = 1386 GB/s required to
Registers Registers Registers Registers
achieve peak FLOP rating
86.4 GB/s limits the code at 21.6
GFLOPS Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
WIDTH
read by Width threads.
• Load each element into
Shared Memory and have
several threads use the M P
WIDTH
Tiled algorithms tx
WIDTH WIDTH 68
bx
Tiled Multiply 0 1 2
tx
01 2 TILE_WIDTH-1
• Break up the execution of the
TILE_WIDTH TILE_WIDTH
Nd
WIDTH
accesses in each phase is focused
on one subset (tile) of Md and Nd
Md Pd
TILE_WIDTHE
0 Pdsub
WIDTH
1
2
by 1
ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
A Small Example
Nd0,0Nd1,0
Nd0,1Nd1,1
Nd0,2Nd1,2
Nd0,3Nd1,3
Md0,0Md1,0Md2,0Md3,0 Pd0,0Pd1,0Pd2,0Pd3,0
Md0,1Md1,1Md2,1Md3,1 Pd0,1Pd1,1Pd2,1Pd3,1
Pd0,2Pd1,2Pd2,2Pd3,2
Pd0,3Pd1,3Pd2,3Pd3,3
Nd0,0Nd1,0
Nd0,1Nd1,1
Nd0,2Nd1,2
Nd0,3Nd1,3
Md0,0Md1,0Md2,0Md3,0 Pd0,0Pd1,0Pd2,0Pd3,0
Md0,1Md1,1Md2,1Md3,1 Pd0,1Pd1,1Pd2,1Pd3,1
Pd0,2Pd1,2Pd2,2Pd3,2
Pd0,3Pd1,3Pd2,3Pd3,3
First-order Size Considerations in G80
• Each thread block should have many threads
TILE_WIDTH of 16 gives 16*16 = 256 threads
tx
• Each block computes one square 01 2 TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
Nd
sub-matrix Pdsub of size
m
TILE_WIDTH
WIDTH
• Each thread computes one bx k
element of Pdsub
Md Pd
0
by
TILE_WIDTHE
0 Pdsub
WIDTH
ty
1
2 m
by 1
TILE_WIDTH-1
k
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
SM size is implementation dependent!
For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of
shared memory.
Can potentially have up to 8 Thread Blocks actively executing
This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads
per block)
The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared
memory usage per thread block, allowing only up to two thread blocks
active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
77
GFLOPS
0
10
20
30
40
50
60
70
80
90
100
not tiled
tiled
only
4x4 tiles
tiled &
unrolled
tiled
Tiling Size Effects
only
8x8 tiles
tiled &
unrolled
tiled
only
tiled &
12x12 tiles
unrolled
tiled
only
tiled &
16x16 tiles
unrolled
Typical Structure of a CUDA Program
• Global variables declaration
__host__
__device__... __global__, __constant__, __texture__
• Function prototypes
__global__ void kernelOne(…)
float handyFunction(…)
• Main ()
allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)
execution configuration setup
kernel call – kernelOne<<<execution configuration>>>( args… ); repeat
transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…)
as
optional: compare against golden (host computed) solution
• Kernel – void kernelOne(type args,…)
needed
variables declaration - __local__, __shared__
automatic variables transparently assigned to registers or local memory
syncthreads()…
• Other functions
float handyFunction(int inVar…);
GPU for Machine learning
Machine learning
• With improved sensors, the amount of data available
has increased by several folds over the past decade.
• Also, more robust and sophisticated learning
algorithms have been developed to extract
meaningful information from the data
• This has resulted in the application of these
algorithms in many areas:
Geostatistics, astronomical predictions, weather data
assimilations, computational finance.
Extracting information from the data
• “Extracting information from the data” means converting the
raw data to an interpretable version
For example, given a face image, it would desirable to
extract the identity of the person, the face pose, etc
• Information extraction categories
Regression – [fitting a continuous function]
Classification – [classify into one of the predefined classes]
Density estimation – [evaluating the class membership]
Ranking – [preference relationships between classes]
• Bottom-line: Infer the relationships based on the data
Build the relationship model from the data
Relationship modeling
• There are two primary categories of the models
Parametric
Non-parametric
• Parametric model:
Assumes a known parametric form of the “relationship”
Estimates the parameters of this “form” from the data
• Non-parametric model
Do not make any assumptions on the form of the
underlying function.
“Letting the data speak for itself”
Kernel methods
• A class of robust non-parametric learning methods
• Projects the data into a higher dimensional space
• Formulates the problem such that only the inner product of the higher
dimension features are required
• The inner-products are given by the kernel functions
• For example the Gaussian kernel is given by:
Scalable learning methods
• Most of these kernel based learning approaches scale O(N2) or O(N3) in
time with respect to data
Gaussian
Matern
Periodic
Epanechnikov
Raw speedups across dimension
Applications
• Kernel density estimation
• Gaussian process regression
• Meanshift clustering
• Ranking
• And many more…
Kernel Density Estimation
• Non-parametric way of estimating probability density
function of a random variable
1
10
Time taken
0
10
-1
10
-2
10
2 3 4 5
10 10 10 10
Size of data
GPR on standard datasets
GPU based kernel summation
• Still O(N2)!
• A linear approximation algorithm can beat this
beyond “some” N
• FMM – based Gaussian kernel (FIGTREE) vs GPU
version
FIGTREE vs GPU 1
FIGTREE vs GPU 2
FIGTREE vs GPU 3
Further :
• More interesting: FMM on GPU
• Issues on data structures
• Need to consider many factors.
Jack Dongarra
Professor, University of Tennessee
Author of Linpack
We’ve all heard ‘desktop supercomputer’ claims in the past,
but this time it’s for real: NVIDIA and its partners will be
delivering outstanding performance and broad applicability
to the mainstream marketplace.
Burton Smith
Technical Fellow, Microsoft
Formerly, Chief Scientist at Cray