Академический Документы
Профессиональный Документы
Культура Документы
Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University
Stanford University
Technology Architecture
Moores law 100 million transistors by 2000 Wires get slower relative to transistors Long cross-chip wires are especially slow Plenty of room for innovation Single cycle communication requires localized blocks of logic High communication bandwidth across the chip easier to achieve than low latency
Stanford University
Architectural implications
Thread
Loop
Stanford University
Hydra Approach
A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control
Stanford University
Outline
Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions
Stanford University
CPU 0
CPU 1
CPU 2
CPU 3
L1 Ins t. Ca che
L1 Da ta Cache
L1 Ins t. Ca che
L1 Da ta Cache
L1 Ins t. Ca che
L1 Da ta Cache
L1 Ins t. Ca che
L1 Da ta Cache
On-chip L2 Cache
I/O Devices
Single-chip multiprocessor Four processors Separate primary caches Write-through data caches to maintain coherence
Shared 2nd-level cache Low latency interprocessor communication (10 cycles) Separate read and write buses
Stanford University
2 1.5 1 0.5 0
ILP only SS 30-50% better than single Hydra processor ILP & fine thread SS and Hydra comparable ILP & coarse thread Hydra 1.52better The Case for a CMP ASPLOS 96
Speedup
compress
MPEG2
applu
swim
apsi
tomcatv
eqntott
m88ksim
OLTP
pmake
Stanford University
Threads have data dependencies synchronization Pointer disambiguation is difficult and expensive Compile time analysis is too conservative Remove need for pointer disambiguation Allow the compiler to be aggressive
Stanford University
Loads and stores follow original sequential semantics Speculation hardware ensures correctness Add synchronization only for performance Loop parallelization is now easily automated Break code into arbitrary threads (e.g. speculative subroutines ) Parallel execution with sequential commits Wisconsin multiscalar Hydra provides low-overhead support for CMP
Stanford University
Forward data between parallel threads Detect violations when reads occur too early
Stanford University
write X
TIME
1
TRASH
PERMANENT STATE
Safely discard bad state after violation Correctly retire speculative state
Stanford University
Stanford University
CPU 0
CP2
CPU 1
CP2
CPU 2
CP2
CPU 3
CP2
L1 Ins t. Ca che
L1 Ins t. Ca che
L1 Ins t. Ca che
L1 Ins t. Ca che
#0
#1
#2
#3
On-chip L2 Cache
I/O Devices
Write bus and L2 buffers provide forwarding Read L1 tag bits detect violations Dirty L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding for view Speculation coprocessors to control threads Stanford University
Speculative Reads
Nonspe cula tive Hea d CPU Speculative earlier CPU
Me
CPU #i-2
CPU #i-1
CPU #i
CPU #i+1
1 L1 hit L1 Cache
The read bits are set
C
Write Buffe r
B
Write Buffe r
A
Write Buffe r Write Buffe r
L2 Cache
L1 miss
L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)
Stanford University
Speculative Writes
A CPU writes to its L1 cache & write buffer Earlier CPUs invalidate our L1 & cause RAW hazard checks Later CPUs just pre-invalidate our L1 Non-speculative write buffer drains out into the L2
Stanford University
Software Handlers
Control speculative threads through CP2 interface Track order of all speculative threads Exception routines recover from data dependency violations Adds more overhead to speculation than hardware but more flexible and simpler to implement Complete description in Data Speculation Support for a Chip Multiprocessor ASPLOS 98 and Improving the Performance of Speculatively Parallel Applications on the Hydra CMP ICS 99
Stanford University
Speculative loops
for and while loop iterations Typically one speculative thread per iteration Execute code after procedure speculatively Procedure calls generate a speculative thread
Speculative procedures
Pfor, pwhile Analyze loop body and globalize any local variables that could cause loop-carried dependencies
Stanford University
3.5
3 2.5 Speedup 2 1.5 1 0.5 0
Base
Entire applications GCC 2.7.2 -O2 4 single-issue processors Accurate modeling of all aspects of Hydra architecture and real runtime system
m88ksim
ijpeg
cholesky
simplex
sparse1.3
mpeg2
eqntott
alvin
compress
grep
ear
wc
Stanford University
Threads are not created sequentially Dynamic thread scheduling necessary Start and end of loop: 75 cycles End of iteration: 80 cycles Best performing speculative applications use loops Procedure speculation often lowers performance Need to optimize RTS for common case Start and end of loop: 25 cycles End of iteration: 12 cycles (almost a factor of 7) Limit procedure speculation to specific procedures
Stanford University
Performance
1 0.5 0 m88ksim ijpeg cholesky sparse1.3 compress simplex mpeg2 eqntott alvin grep ear wc
Improves performance of all applications Most improvement for applications with finegrained threads Eqntott uses procedure speculation
Stanford University
No explicit data movement 100+ cycle communication latency Need to optimize for data locality Look at cache misses (MemSpy, Flashpoint) No explicit data independence Frequent dependence violations limit performance Need to optimize to reduce frequency and impact of data violations Dependence prediction can help Look at violation statistics (requires some hardware support)
Stanford University
Speculative threads
Feedback tool
Collects violation statistics (PCs, frequency, work lost) Correlates read and write PC values with source code Synchronize frequently occurring violations Use non-violating loads Find dependent load-stores Move loads down in thread Move stores up in thread
Synchronization
Code Motion
Stanford University
Code Motion
Rearrange reads and writes to increase parallelism Delay reads and advance writes Create local copies to allow earlier data forwarding
iteration i read x read x iteration i read x write x read x iteration i+1 read x write x read x write x iteration i+1 read x
write x
Stanford University
Base performance Optimized RTS with no manual intervention Violation statistics used to manually transform code
0.5
0 cholesky mpeg2 eqntott alvin m88ksim ijpeg grep ear wc
simplex
sparse1.3
compress
Stanford University
24 40 11 28 8 32 56 158 4 82 14
Stanford University
Max size determines size of write buffer for max performance Non-head processor stalls when write buffer fills up Small write buffers (< 64 lines) will achieve good performance
Hydra Prototype
Design based on Integrated Device Technology (IDT) RC32364 88 mm2 in 0.25mm with 8 KB I, D and 128 KB L2 Stanford University
Conclusions
Single-chip MP exploits parallelism at all levels Low overhead support for speculative parallelism Provides high performance on applications with medium to large-grain parallelism Allows performance optimization migration path for difficult to parallelize fine-grain applications Work out implementation details Provide platform for application and compiler development Realistic performance evaluation
Stanford University
Prototype Implementation
Hydra Team
Team
Monica Lam, Lance Hammond, Mike Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT)
URL
http://www-hydra.stanford.edu
Stanford University