Вы находитесь на странице: 1из 28

The Stanford Hydra Chip Multiprocessor

Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University

Stanford University

Technology Architecture

Transistors are cheap, plentiful and fast


Moores law 100 million transistors by 2000 Wires get slower relative to transistors Long cross-chip wires are especially slow Plenty of room for innovation Single cycle communication requires localized blocks of logic High communication bandwidth across the chip easier to achieve than low latency
Stanford University

Wires are cheap, plentiful and slow


Architectural implications

Exploiting Program Parallelism


Process Levels of Parallelism

Thread

Loop

Instruction 1 10 100 1K 10K 100K 1M

Grain Size (instructions)

Stanford University

Hydra Approach

A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control

Exploits parallelism at all levels Makes it easy to develop parallel programs

Memory renaming and thread-level speculation

Keep design simple by taking advantage of single chip implementation

Stanford University

Outline

Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions

Stanford University

The Base Hydra Design


Centralize d Bus Arbitration Mechanis ms

CPU 0

CPU 1

CPU 2

CPU 3

L1 Ins t. Ca che

L1 Da ta Cache

L1 Ins t. Ca che

L1 Da ta Cache

L1 Ins t. Ca che

L1 Da ta Cache

L1 Ins t. Ca che

L1 Da ta Cache

CPU 0 Me mory Controller

CPU 1 Me mory Controller

CPU 2 Me mory Controller

CPU 3 Me mory Controller

Write-through Bus (64b)


Read/Replace Bus (256b)

On-chip L2 Cache

Rambus Memory Interface

I/O Bus Interface

DRAM Main Memory

I/O Devices

Single-chip multiprocessor Four processors Separate primary caches Write-through data caches to maintain coherence

Shared 2nd-level cache Low latency interprocessor communication (10 cycles) Separate read and write buses
Stanford University

Hydra vs. Superscalar


4

Hydra 4 x 2-way issue 3.5


3 2.5 Superscalar 6-way issue

2 1.5 1 0.5 0

ILP only SS 30-50% better than single Hydra processor ILP & fine thread SS and Hydra comparable ILP & coarse thread Hydra 1.52better The Case for a CMP ASPLOS 96

Speedup

compress

MPEG2

applu

swim

apsi

tomcatv

eqntott

m88ksim

OLTP

pmake

Stanford University

Problem: Parallel Software

Parallel software is limited


Hand-parallelized applications Auto-parallelized dense matrix FORTRAN applications

Traditional auto-parallelization of C-programs is very difficult

Threads have data dependencies synchronization Pointer disambiguation is difficult and expensive Compile time analysis is too conservative Remove need for pointer disambiguation Allow the compiler to be aggressive
Stanford University

How can hardware help?

Solution: Data Speculation

Data speculation enables parallelization without regard for data-dependencies


Loads and stores follow original sequential semantics Speculation hardware ensures correctness Add synchronization only for performance Loop parallelization is now easily automated Break code into arbitrary threads (e.g. speculative subroutines ) Parallel execution with sequential commits Wisconsin multiscalar Hydra provides low-overhead support for CMP
Stanford University

Other ways to parallelize code


Data speculation support


Data Speculation Requirements I

Forward data between parallel threads Detect violations when reads occur too early
Stanford University

Data Speculation Requirements II


Writes after Violations
Iteration i Iteration i+1 write A re ad X write B

Writes after Successful Iterations


Iteration i Iteration i+1 write X write X

write X

TIME

1
TRASH

PERMANENT STATE

Safely discard bad state after violation Correctly retire speculative state
Stanford University

Data Speculation Requirements III

Maintain multiple views of memory

Stanford University

Hydra Speculation Support


Centralize d Bus Arbitration Me chanis ms

CPU 0

CP2

CPU 1

CP2

CPU 2

CP2

CPU 3

CP2

L1 Ins t. Ca che

L1 Da ta Cache & Speculation Bits

L1 Ins t. Ca che

L1 Da ta Cache & Speculation Bits

L1 Ins t. Ca che

L1 Da ta Cache & Speculation Bits

L1 Ins t. Ca che

L1 Da ta Cache & Speculation Bits

CPU 0 Me mory Controller

CPU 1 Me mory Controller

CPU 2 Me mory Controller

CPU 3 Me mory Controller

Write-through Bus (64b)


Spe culation Wr ite Buffer s

Read/Replace Bus (256b)


re tire

#0

#1

#2

#3

On-chip L2 Cache

Rambus Memory Interface

I/O Bus Interface

DRAM Main Memory

I/O Devices

Write bus and L2 buffers provide forwarding Read L1 tag bits detect violations Dirty L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding for view Speculation coprocessors to control threads Stanford University

Speculative Reads
Nonspe cula tive Hea d CPU Speculative earlier CPU
Me

Speculative later CPU

CPU #i-2

CPU #i-1

CPU #i

CPU #i+1

1 L1 hit L1 Cache
The read bits are set

C
Write Buffe r

B
Write Buffe r

A
Write Buffe r Write Buffe r

L2 Cache

L1 miss
L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)
Stanford University

Speculative Writes

A CPU writes to its L1 cache & write buffer Earlier CPUs invalidate our L1 & cause RAW hazard checks Later CPUs just pre-invalidate our L1 Non-speculative write buffer drains out into the L2
Stanford University

Speculation Runtime System

Software Handlers

Control speculative threads through CP2 interface Track order of all speculative threads Exception routines recover from data dependency violations Adds more overhead to speculation than hardware but more flexible and simpler to implement Complete description in Data Speculation Support for a Chip Multiprocessor ASPLOS 98 and Improving the Performance of Speculatively Parallel Applications on the Hydra CMP ICS 99

Stanford University

Creating Speculative Threads

Speculative loops

for and while loop iterations Typically one speculative thread per iteration Execute code after procedure speculatively Procedure calls generate a speculative thread

Speculative procedures

Compiler support C source to source translator

Pfor, pwhile Analyze loop body and globalize any local variables that could cause loop-carried dependencies

Stanford University

Base Speculative Thread Performance


4

3.5
3 2.5 Speedup 2 1.5 1 0.5 0

Base

Entire applications GCC 2.7.2 -O2 4 single-issue processors Accurate modeling of all aspects of Hydra architecture and real runtime system

m88ksim

ijpeg

cholesky

simplex

sparse1.3

mpeg2

eqntott

alvin

compress

grep

ear

wc

Stanford University

Improving Speculative Runtime System

Procedure support adds overhead to loops

Threads are not created sequentially Dynamic thread scheduling necessary Start and end of loop: 75 cycles End of iteration: 80 cycles Best performing speculative applications use loops Procedure speculation often lowers performance Need to optimize RTS for common case Start and end of loop: 25 cycles End of iteration: 12 cycles (almost a factor of 7) Limit procedure speculation to specific procedures
Stanford University

Performance

Lower speculative overheads


Improved Speculative Performance


4 3.5 3 2.5 Speedup 2 1.5 Optimized RTS Base

1 0.5 0 m88ksim ijpeg cholesky sparse1.3 compress simplex mpeg2 eqntott alvin grep ear wc

Improves performance of all applications Most improvement for applications with finegrained threads Eqntott uses procedure speculation

Stanford University

Optimizing Parallel Performance

Cache coherent shared memory


No explicit data movement 100+ cycle communication latency Need to optimize for data locality Look at cache misses (MemSpy, Flashpoint) No explicit data independence Frequent dependence violations limit performance Need to optimize to reduce frequency and impact of data violations Dependence prediction can help Look at violation statistics (requires some hardware support)
Stanford University

Speculative threads

Feedback and Code Transformations

Feedback tool

Collects violation statistics (PCs, frequency, work lost) Correlates read and write PC values with source code Synchronize frequently occurring violations Use non-violating loads Find dependent load-stores Move loads down in thread Move stores up in thread

Synchronization

Code Motion

Stanford University

Code Motion

Rearrange reads and writes to increase parallelism Delay reads and advance writes Create local copies to allow earlier data forwarding
iteration i read x read x iteration i read x write x read x iteration i+1 read x write x read x write x iteration i+1 read x

write x

Stanford University

Optimized Speculative Performance


4 3.5 3 2.5 Speedup 2 1.5 1

Base performance Optimized RTS with no manual intervention Violation statistics used to manually transform code

0.5
0 cholesky mpeg2 eqntott alvin m88ksim ijpeg grep ear wc

simplex

sparse1.3

compress

Stanford University

Size of Speculative Write State


Max no. lines of write state
compress eqntott

24 40 11 28 8 32 56 158 4 82 14
Stanford University

Max size determines size of write buffer for max performance Non-head processor stalls when write buffer fills up Small write buffers (< 64 lines) will achieve good performance

grep m88ksim wc ijpeg mpeg alvin cholesky ear simplex

32 byte cache lines

Hydra Prototype

Design based on Integrated Device Technology (IDT) RC32364 88 mm2 in 0.25mm with 8 KB I, D and 128 KB L2 Stanford University

Conclusions

Hydra offers a new way to design microprocessors


Single-chip MP exploits parallelism at all levels Low overhead support for speculative parallelism Provides high performance on applications with medium to large-grain parallelism Allows performance optimization migration path for difficult to parallelize fine-grain applications Work out implementation details Provide platform for application and compiler development Realistic performance evaluation
Stanford University

Prototype Implementation

Hydra Team

Team
Monica Lam, Lance Hammond, Mike Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT)

URL

http://www-hydra.stanford.edu

Stanford University

Вам также может понравиться