Вы находитесь на странице: 1из 28

Limitations to ILP

Chapter 3.8,3.9,3.11-3.13

How much ILP exists?


Set up an experiment that removes all
obstacles from obtaining ILP
Gradually add back obstacles to see how
much each one matters

Ideal Machine

Results

tomcatv
doduc
fpppp
li
espresso
gcc

Instructions
issued per
cycle

50

100

150

200

Observations
First three are floating-point programs
data-intensive, loop-intensive

Last three are integer programs


Integer programs have less ILP than floatingpoint programs
Range is 18-150, much greater than current
machines
Current 4-wide machines often only net a
utilization of 50%. Why? Where can we
improve?

Window Size & Issue Width


Window Size

Issue Width

Limited Window Size


160
140
120
100
80
60
40
20
0
Infinite

2K

tomcatv

512

128

doduc

fppp

32

espresso

gcc

li

Observations

Dramatic drop very quickly


Still well above current issue widths
Current window sizes 32-126
Remember: Still perfect branch, cache,
etc.
For the rest of the experiments:
2k window size
64 instructions/cycle

Branch Prediction
70
60
50
40
30
20
10
0
Perfect

Tournament

tomcatv

doduc

2-bit

fppp

Static

espresso

None

gcc

li

Observations
We do not take misprediction penalty, just
reduce ILP
tomcatv & fppp

Results
gcc
espresso
tournament
2-bit counter
static (profile)

li
fppp
doduc
tomcatv
0

20

40

60

80

100

Observations
Tomcatv has insanely high accuracy
integer programs generally lower than
scientific (floating-point)
For the rest, use a larger tournament
predictor than shown in this picture (and
2k window size, 64 issue width)

Extra registers for renaming


70
60
50
40
30
20
10
0
Infinite

256

tomcatv

128

doduc

64

fppp

espresso

32

None

gcc

li

Observations
tomcatv & fppp more sensitive to number
of registers why?
For rest of results, 256 integer & 256 FP
regs
Alpha 21264 41 integer & 41 FP (+ 32 of
each in ISA)
Which do you think made more difference
regs or branch prediction?

Alias Analysis
Memory Disambiguation
Global/stack perfect
assumes all heap refs conflict
perfectly predicts global & stack

Inspection Compile time


different constant offsets from same register
pointers assigned to different areas
(heap,stack)
similar to current compilers

None assume all conflict

Alias Analysis
Memory Disambiguation
60
50
40
30
20
10
0
Perfect

tomcatv

Global/Stack
Perfect

doduc

Inspection

fppp

espresso

None

gcc

li

Observations
Compiler is not good enough
Scientific programs have few heap
references

Conclusions Perfect & realistic


machines
WAR and WAW hazards through memory
Unnecessary dependencies
loop counter
return address storing, restoring in calls

RAW hazards
value prediction

Conclusions realistic machines


Branch prediction is critical
scientific programs do pretty well
memory disambiguation is hard

Realistic machines
Address value prediction and speculation
predict the address
reorder if two predicted addresses dont
match

Speculating on multiple paths


exponential fan-out with each branch
Can not do every branch
How do we choose which branches?

An alternate approach
Accept that there will
be delays in
processing
Give the processor
something else to do
while it is waiting
Let two threads share
the same machine
TLP vs ILP

What identifies a thread?


Separate PC
Separate registers
Separate or shared memory

Does this buy us anything?


Still limited by window size
Still limited by # of functional units
Only difference is two separate threads of
execution to choose from.

Advantages
What is the difference between filling the
window with instructions from two threads
and instructions from one thread?

What happens when one thread has a


cache miss?
What does that tell us about how much
speculation we should do?

Different Perspectives
SMT allows one thread to get work done
while the other thread is waiting for
something
SMT is a parallel machine that allows
dynamic resource allocation
SMT is a parallel machine that is
unnecessarily large and complex, and
parallel threads would be better off on
separate, simpler processing elements.

Disadvantages
Interference

Clock rate

Fallacies
Processors with lower CPIs will always be
faster
Processors with faster clock rates will
always be faster

Pitfall
Improving CPI by increasing width but
sacrificing ______________
Improving only one aspect of a multipleissue processor and expecting overall
performance improvement
Sacrificing complexity for space