Академический Документы
Профессиональный Документы
Культура Документы
(Device) Grid
Host
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Thread
Local Memory:
Local Memory
Shared
Memory
Block
per-thread
Grid 0
...
Global
Memory
Grid 1
...
Sequential
Grids
in Time
SM Memory Architecture
t0 t1 t2 tm
SM 0 SM 1
MT IU
SP
t0 t1 t2 tm
MT IU
Blocks
SP
Blocks
Shared
Memory
Courtesy:
John Nicols, NVIDIA
Shared
Memory
TF
Texture L1
L2
Memory
SM Register File
I$
L1
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
SFU
4blocks
3blocks
This is an implementation
decision, not part of CUDA
Registers are dynamically
partitioned across all blocks
assigned to the SM
Once assigned to a block, the
register is NOT accessible by
threads in other blocks
Each thread in the same block
only access registers assigned
to itself
M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
10
Memory Coalescing
When accessing global memory, peak performance utilization
occurs when all threads in a half warp access continuous
memory locations.
Notcoalesced
Md
coalesced
Nd
WIDTH
Thread1
Thread2
WIDTH
11
TimePeriod1
TimePeriod2
T1 T2 T3 T4
T1 T2 T3 T4
M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
12
Access
directionin
Kernel
code
TimePeriod2
T1
T2
T3
T4
TimePeriod1
T1
T2
T3
T4
M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
13
Constants
Multithreaded
Instruction Buffer
L1 per SM
I$
L1
R
F
Shared
Mem
Operand Select
MAD
C$
L1
SFU
14
Shared Memory
Each SM has 16 KB of Shared Memory
I$
L1
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
SFU
15
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Bank 15
16
No Bank Conflicts
Linear addressing
stride == 1
No Bank Conflicts
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 15
Bank 15
Thread 15
Bank 15
17
Linear addressing
stride == 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 8
Thread 9
Thread 10
Thread 11
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 15
Thread 15
Linear addressing
stride == 8
x8
x8
Bank 0
Bank 1
Bank 2
Bank 7
Bank 8
Bank 9
Bank 15
18
19
Bank Conflict: multiple threads in the same half-warp access the same
bank
Must serialize the accesses
Cost = max # of simultaneous accesses to a single bank
20
Linear Addressing
Given:
s=1
Thread 0
Thread 1
Bank 0
Bank 1
Thread 2
Thread 3
Bank 2
Bank 3
Thread 4
Bank 4
Thread 5
Thread 6
Bank 5
Bank 6
Thread 7
Bank 7
Thread 15
Bank 15
s=3
Thread 0
Thread 1
Bank 0
Bank 1
Thread 2
Thread 3
Bank 2
Bank 3
Thread 4
Bank 4
Thread 5
Thread 6
Bank 5
Bank 6
Thread 7
Bank 7
Thread 15
Bank 15
21