Академический Документы
Профессиональный Документы
Культура Документы
GT200
G80
CUDA Computing with Tesla T10
240 SP processors at 1.44 GHz: 1 TFLOPS peak
30 DP processors at 1.44 GHz: 86 GFLOPS peak
128 threads per processor: 30,720 threads total
SM
I-Cache
MT Issue
C-Cache
SP SP
Host CPU Bridge System Mem ory
Work Distribution
Tesla T10 SP SP
I - C ache
SM C
``
SM C
SM C
SM C
SM C
SM C
SM C
SM C
SM C
SM C
I - C ache I- C ache
SP SP
M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue M TIssue
C - C ache C - C ache C - C ache C - C ache C - C ache C -C ache C - C ache C - C ache C- C ache C -C ache C - C ache C - C ache C -C ache C - C ache C - C ache C- C ache C - C ache C - C ache C - C ache C - C ache C - C ache C - C ache C - C ache C - C ache C - C ache C -C ache C - C ache C - C ache C -C ache C - C ache
SP SP
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
` ` ` ` ` ` ` ` ` `
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SFU SFU
SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU
DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP DP
Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed Shar ed
M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y M em or y
Textur eU nit Textur eU nit Textur eU nit Textur eU nit Textur eU nit Textur eU nit Textur eU nit Textur eU nit Textur eU nit Textur eU nit
TexL 1 TexL 1 TexL 1 TexL 1 TexL 1 TexL 1 TexL 1 TexL 1 TexL 1 TexL 1
Interconnection Network
DP
Shared
ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2
Supports standard
languages and APIs
C
OpenCL
DX Compute
Fortran (PGI) !!!!!!"#
Compute Fortran
Supported on common
operating systems:
Windows
Mac OS X
Linux
float x = input[threadID];
float y = func(x);
output[threadID] = y;
Example: Increment Array Elements
Increment N-element vector a by scalar b
6
Example: Increment Array Elements
void increment_cpu(float *a, float b, int N) __global__ void increment_gpu(float *a, float b, int N)
{ {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int idx = 0; idx<N; idx++) if (idx < N)
a[idx] = a[idx] + b; a[idx] = a[idx] + b;
} }
void main()
void main() {
{ ..
..... dim3 dimBlock (blocksize);
increment_cpu(a, b, N); dim3 dimGrid( ceil( N / (float)blocksize) );
} increment_gpu<<<dimGrid, dimBlock>>>(ad,bd, N);
}
7
Pervasive Parallel Computing with CUDA
Init
Alloc GPU
CPU
Operation 1 Operation 2 Operation 3
$%&&'!()*!+,-./!012"23!',45.!67898!:;<
=>?!78)7@7/!A%.BC6D@,!2E-,5!6D@,!FG!H0,I.5,JK!*)(:LM
Applications
Folding@home Performance Comparison
GPU
CPU
CPU
Heterogeneous Computing
CUDA