Cs252 Lecture 20

The Stanford Hydra Chip Multiprocessor
Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University
Stanford University
Technology Architecture
Transistors are cheap, plentiful and fast

Moores law 100 million transistors by 2000 Wires get slower relative to transistors Long cross-chip wires are especially slow Plenty of room for innovation Single cycle communication requires localized blocks of logic High communication bandwidth across the chip easier to achieve than low latency
Stanford University
Wires are cheap, plentiful and slow

Architectural implications

Exploiting Program Parallelism

Process Levels of Parallelism
Thread
Loop
Instruction 1 10 100 1K 10K 100K 1M
Grain Size (instructions)
Stanford University
Hydra Approach

A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control
Exploits parallelism at all levels Makes it easy to develop parallel programs
Memory renaming and thread-level speculation
Keep design simple by taking advantage of single chip implementation
Stanford University
Outline

Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions
Stanford University
The Base Hydra Design

Centralize d Bus Arbitration Mechanis ms
CPU 0
CPU 1
CPU 2
CPU 3
L1 Ins t. Ca che
L1 Da ta Cache
L1 Ins t. Ca che
L1 Da ta Cache
L1 Ins t. Ca che
L1 Da ta Cache
L1 Ins t. Ca che
L1 Da ta Cache
CPU 0 Me mory Controller
Write-through Bus (64b)

Read/Replace Bus (256b)
On-chip L2 Cache
Rambus Memory Interface
I/O Bus Interface
DRAM Main Memory
I/O Devices
Single-chip multiprocessor Four processors Separate primary caches Write-through data caches to maintain coherence
Shared 2nd-level cache Low latency interprocessor communication (10 cycles) Separate read and write buses
Stanford University
Hydra vs. Superscalar

4
Hydra 4 x 2-way issue 3.5

3 2.5 Superscalar 6-way issue
2 1.5 1 0.5 0
ILP only SS 30-50% better than single Hydra processor ILP & fine thread SS and Hydra comparable ILP & coarse thread Hydra 1.52better The Case for a CMP ASPLOS 96
Speedup
compress
MPEG2
applu
swim
apsi
tomcatv
eqntott
m88ksim
OLTP
pmake
Stanford University
Problem: Parallel Software
Parallel software is limited

Hand-parallelized applications Auto-parallelized dense matrix FORTRAN applications
Traditional auto-parallelization of C-programs is very difficult
Threads have data dependencies synchronization Pointer disambiguation is difficult and expensive Compile time analysis is too conservative Remove need for pointer disambiguation Allow the compiler to be aggressive
Stanford University
How can hardware help?
Solution: Data Speculation
Data speculation enables parallelization without regard for data-dependencies

Loads and stores follow original sequential semantics Speculation hardware ensures correctness Add synchronization only for performance Loop parallelization is now easily automated Break code into arbitrary threads (e.g. speculative subroutines ) Parallel execution with sequential commits Wisconsin multiscalar Hydra provides low-overhead support for CMP
Stanford University
Other ways to parallelize code

Data speculation support

Data Speculation Requirements I
Forward data between parallel threads Detect violations when reads occur too early
Stanford University
Data Speculation Requirements II

Writes after Violations
Iteration i Iteration i+1 write A re ad X write B
Writes after Successful Iterations

Iteration i Iteration i+1 write X write X
write X
TIME
1
TRASH
PERMANENT STATE
Safely discard bad state after violation Correctly retire speculative state
Stanford University
Data Speculation Requirements III
Maintain multiple views of memory
Stanford University
Hydra Speculation Support

Centralize d Bus Arbitration Me chanis ms
CPU 0
CP2
CPU 1
CP2
CPU 2
CP2
CPU 3
CP2
L1 Ins t. Ca che
L1 Da ta Cache & Speculation Bits
L1 Ins t. Ca che
L1 Ins t. Ca che
L1 Ins t. Ca che
Write-through Bus (64b)

Spe culation Wr ite Buffer s
Read/Replace Bus (256b)

re tire
#0
#1
#2
#3
On-chip L2 Cache
Rambus Memory Interface
I/O Bus Interface
DRAM Main Memory
I/O Devices
Write bus and L2 buffers provide forwarding Read L1 tag bits detect violations Dirty L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding for view Speculation coprocessors to control threads Stanford University
Speculative Reads
Nonspe cula tive Hea d CPU Speculative earlier CPU
Me
Speculative later CPU
CPU #i-2
CPU #i-1
CPU #i
CPU #i+1
1 L1 hit L1 Cache
The read bits are set
C
Write Buffe r
B
Write Buffe r
A
Write Buffe r Write Buffe r
L2 Cache
L1 miss
L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)
Stanford University
Speculative Writes
A CPU writes to its L1 cache & write buffer Earlier CPUs invalidate our L1 & cause RAW hazard checks Later CPUs just pre-invalidate our L1 Non-speculative write buffer drains out into the L2
Stanford University
Speculation Runtime System
Software Handlers

Control speculative threads through CP2 interface Track order of all speculative threads Exception routines recover from data dependency violations Adds more overhead to speculation than hardware but more flexible and simpler to implement Complete description in Data Speculation Support for a Chip Multiprocessor ASPLOS 98 and Improving the Performance of Speculatively Parallel Applications on the Hydra CMP ICS 99
Stanford University
Creating Speculative Threads
Speculative loops

for and while loop iterations Typically one speculative thread per iteration Execute code after procedure speculatively Procedure calls generate a speculative thread
Speculative procedures

Compiler support C source to source translator
Pfor, pwhile Analyze loop body and globalize any local variables that could cause loop-carried dependencies
Stanford University
Base Speculative Thread Performance

4
3.5
3 2.5 Speedup 2 1.5 1 0.5 0
Base
Entire applications GCC 2.7.2 -O2 4 single-issue processors Accurate modeling of all aspects of Hydra architecture and real runtime system
m88ksim
ijpeg
cholesky
simplex
sparse1.3
mpeg2
eqntott
alvin
compress
grep
ear
wc
Stanford University
Improving Speculative Runtime System
Procedure support adds overhead to loops
Threads are not created sequentially Dynamic thread scheduling necessary Start and end of loop: 75 cycles End of iteration: 80 cycles Best performing speculative applications use loops Procedure speculation often lowers performance Need to optimize RTS for common case Start and end of loop: 25 cycles End of iteration: 12 cycles (almost a factor of 7) Limit procedure speculation to specific procedures
Stanford University
Performance

Lower speculative overheads

Improved Speculative Performance

4 3.5 3 2.5 Speedup 2 1.5 Optimized RTS Base
1 0.5 0 m88ksim ijpeg cholesky sparse1.3 compress simplex mpeg2 eqntott alvin grep ear wc
Improves performance of all applications Most improvement for applications with finegrained threads Eqntott uses procedure speculation
Stanford University
Optimizing Parallel Performance
Cache coherent shared memory

No explicit data movement 100+ cycle communication latency Need to optimize for data locality Look at cache misses (MemSpy, Flashpoint) No explicit data independence Frequent dependence violations limit performance Need to optimize to reduce frequency and impact of data violations Dependence prediction can help Look at violation statistics (requires some hardware support)
Stanford University
Speculative threads

Feedback and Code Transformations
Feedback tool

Collects violation statistics (PCs, frequency, work lost) Correlates read and write PC values with source code Synchronize frequently occurring violations Use non-violating loads Find dependent load-stores Move loads down in thread Move stores up in thread
Synchronization

Code Motion

Stanford University
Code Motion

Rearrange reads and writes to increase parallelism Delay reads and advance writes Create local copies to allow earlier data forwarding
iteration i read x read x iteration i read x write x read x iteration i+1 read x write x read x write x iteration i+1 read x
write x
Stanford University
Optimized Speculative Performance

4 3.5 3 2.5 Speedup 2 1.5 1
Base performance Optimized RTS with no manual intervention Violation statistics used to manually transform code
0.5
0 cholesky mpeg2 eqntott alvin m88ksim ijpeg grep ear wc
simplex
sparse1.3
compress
Stanford University
Size of Speculative Write State

Max no. lines of write state
compress eqntott
24 40 11 28 8 32 56 158 4 82 14
Stanford University
Max size determines size of write buffer for max performance Non-head processor stalls when write buffer fills up Small write buffers (< 64 lines) will achieve good performance
grep m88ksim wc ijpeg mpeg alvin cholesky ear simplex
32 byte cache lines
Hydra Prototype
Design based on Integrated Device Technology (IDT) RC32364 88 mm2 in 0.25mm with 8 KB I, D and 128 KB L2 Stanford University
Conclusions
Hydra offers a new way to design microprocessors

Single-chip MP exploits parallelism at all levels Low overhead support for speculative parallelism Provides high performance on applications with medium to large-grain parallelism Allows performance optimization migration path for difficult to parallelize fine-grain applications Work out implementation details Provide platform for application and compiler development Realistic performance evaluation
Stanford University
Prototype Implementation

Hydra Team
Team
Monica Lam, Lance Hammond, Mike Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT)
URL
http://www-hydra.stanford.edu
Stanford University

Cs252 Lecture 20

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Cs252 Lecture 20

Загружено:

Авторское право:

Доступные форматы

The Stanford Hydra Chip Multiprocessor

Transistors are cheap, plentiful and fast

Wires are cheap, plentiful and slow

Exploiting Program Parallelism

Instruction 1 10 100 1K 10K 100K 1M

Grain Size (instructions)

Exploits parallelism at all levels Makes it easy to develop parallel programs

Memory renaming and thread-level speculation

Keep design simple by taking advantage of single chip implementation

The Base Hydra Design

CPU 0 Me mory Controller

CPU 1 Me mory Controller

CPU 2 Me mory Controller

CPU 3 Me mory Controller

Write-through Bus (64b)

Rambus Memory Interface

I/O Bus Interface

DRAM Main Memory

Hydra vs. Superscalar

Hydra 4 x 2-way issue 3.5

Problem: Parallel Software

Parallel software is limited

Hand-parallelized applications Auto-parallelized dense matrix FORTRAN applications

Traditional auto-parallelization of C-programs is very difficult

How can hardware help?

Solution: Data Speculation

Data speculation enables parallelization without regard for data-dependencies

Other ways to parallelize code

Data speculation support

Data Speculation Requirements I

Data Speculation Requirements II

Writes after Successful Iterations

Data Speculation Requirements III

Maintain multiple views of memory

Hydra Speculation Support

L1 Da ta Cache & Speculation Bits

L1 Da ta Cache & Speculation Bits

L1 Da ta Cache & Speculation Bits

L1 Da ta Cache & Speculation Bits

CPU 0 Me mory Controller

CPU 1 Me mory Controller

CPU 2 Me mory Controller

CPU 3 Me mory Controller

Write-through Bus (64b)

Read/Replace Bus (256b)

Rambus Memory Interface

I/O Bus Interface

DRAM Main Memory

Speculative later CPU

Speculation Runtime System

Creating Speculative Threads

Compiler support C source to source translator

Base Speculative Thread Performance

Improving Speculative Runtime System

Procedure support adds overhead to loops

Lower speculative overheads

Improved Speculative Performance

Optimizing Parallel Performance

Cache coherent shared memory

Feedback and Code Transformations

Optimized Speculative Performance

Size of Speculative Write State