Limitation of ILP

Limitations to ILP
Chapter 3.8,3.9,3.11-3.13
How much ILP exists?

Set up an experiment that removes all
obstacles from obtaining ILP
Gradually add back obstacles to see how
much each one matters
Ideal Machine
Results
tomcatv
doduc
fpppp
li
espresso
gcc
Instructions
issued per
cycle
50
100
150
200
Observations
First three are floating-point programs
data-intensive, loop-intensive
Last three are integer programs

Integer programs have less ILP than floatingpoint programs
Range is 18-150, much greater than current
machines
Current 4-wide machines often only net a
utilization of 50%. Why? Where can we
improve?
Window Size & Issue Width

Window Size
Issue Width
Limited Window Size

160
140
120
100
80
60
40
20
0
Infinite
2K
tomcatv
512
128
doduc
fppp
32
espresso
gcc
li
Observations
Dramatic drop very quickly

Still well above current issue widths
Current window sizes 32-126
Remember: Still perfect branch, cache,
etc.
For the rest of the experiments:
2k window size
64 instructions/cycle
Branch Prediction
70
60
50
40
30
20
10
0
Perfect
Tournament
tomcatv
doduc
2-bit
fppp
Static
espresso
None
gcc
li
Observations
We do not take misprediction penalty, just
reduce ILP
tomcatv & fppp
Results
gcc
espresso
tournament
2-bit counter
static (profile)
li
fppp
doduc
tomcatv
0
20
40
60
80
100
Observations
Tomcatv has insanely high accuracy
integer programs generally lower than
scientific (floating-point)
For the rest, use a larger tournament
predictor than shown in this picture (and
2k window size, 64 issue width)
Extra registers for renaming

70
60
50
40
30
20
10
0
Infinite
256
tomcatv
128
doduc
64
fppp
espresso
32
None
gcc
li
Observations
tomcatv & fppp more sensitive to number
of registers why?
For rest of results, 256 integer & 256 FP
regs
Alpha 21264 41 integer & 41 FP (+ 32 of
each in ISA)
Which do you think made more difference
regs or branch prediction?
Alias Analysis
Memory Disambiguation
Global/stack perfect
assumes all heap refs conflict
perfectly predicts global & stack
Inspection Compile time

different constant offsets from same register
pointers assigned to different areas
(heap,stack)
similar to current compilers
None assume all conflict
Alias Analysis
Memory Disambiguation
60
50
40
30
20
10
0
Perfect
tomcatv
Global/Stack
Perfect
doduc
Inspection
fppp
espresso
None
gcc
li
Observations
Compiler is not good enough
Scientific programs have few heap
references
Conclusions Perfect & realistic

machines
WAR and WAW hazards through memory
Unnecessary dependencies
loop counter
return address storing, restoring in calls
RAW hazards
value prediction
Conclusions realistic machines

Branch prediction is critical
scientific programs do pretty well
memory disambiguation is hard
Realistic machines
Address value prediction and speculation
predict the address
reorder if two predicted addresses dont
match
Speculating on multiple paths

exponential fan-out with each branch
Can not do every branch
How do we choose which branches?
An alternate approach
Accept that there will
be delays in
processing
Give the processor
something else to do
while it is waiting
Let two threads share
the same machine
TLP vs ILP
What identifies a thread?

Separate PC
Separate registers
Separate or shared memory
Does this buy us anything?

Still limited by window size
Still limited by # of functional units
Only difference is two separate threads of
execution to choose from.
Advantages
What is the difference between filling the
window with instructions from two threads
and instructions from one thread?
What happens when one thread has a

cache miss?
What does that tell us about how much
speculation we should do?
Different Perspectives
SMT allows one thread to get work done
while the other thread is waiting for
something
SMT is a parallel machine that allows
dynamic resource allocation
SMT is a parallel machine that is
unnecessarily large and complex, and
parallel threads would be better off on
separate, simpler processing elements.
Disadvantages
Interference
Clock rate
Fallacies
Processors with lower CPIs will always be
faster
Processors with faster clock rates will
always be faster
Pitfall
Improving CPI by increasing width but
sacrificing ______________
Improving only one aspect of a multipleissue processor and expecting overall
performance improvement
Sacrificing complexity for space

Limitation of ILP

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Limitation of ILP

Загружено:

Авторское право:

Доступные форматы

Limitations to ILP

How much ILP exists?

Last three are integer programs

Window Size & Issue Width

Limited Window Size

Dramatic drop very quickly

Extra registers for renaming

Inspection Compile time

None assume all conflict

Conclusions Perfect & realistic

Conclusions realistic machines

Speculating on multiple paths

What identifies a thread?

Does this buy us anything?

What happens when one thread has a

Вам также может понравиться