Академический Документы
Профессиональный Документы
Культура Документы
PENTIUM
Overview
http://developer.intel.com/software/products/itc/architec/ia32/pentdown.htm
http://bwrc.eecs.berkeley.edu/CIC/archive/cpu_history.html
http://bwrc.eecs.berkeley.edu/CIC/die_photos/ 4
http://or1cedar.intel.com/media/training/intro_ht_dt_v1/tutorial/index.htm5
http://bwrc.eecs.berkeley.edu/CIC/
>104 increase in transistor count&clock frequency
over 30 years! 6
X86 μP Operating Modes
7
MIN 8086 80286
8
HOW to IMPROVE PROCESSORS ?
9
The Heat Problem
Rocket Nozzle
1000
Nuclear Reactor
Pentium 4
(Prescott)
Watts/cm2
100
Pentium 4
(Willamette)
Pentium III
Hot Plate Pentium II
10
Pentium Pro
Pentium
i386
i486 • Heat sinks in 6XX
1
1.5 1.0 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07
series Pentium 4s
Increasing Frequency
Microarchitecture Trends
11
Adapted from Johan De Gelas, Quest for More Processing Power, AnandTech, Feb. 8, 2005.
Larger Caches
On-chip caches to ameliorate the growing disparity between processor speed
and memory latency and bandwidth
On-chip caches will continue to increase in size and help mitigate disparities
in computer subsystem performance
1011
1010 2G 4G
512M 1G
109 Memory 256M
Transistors Per Die
13
The Personal Computer Architecture
Speaker
Micro architecture +
ISA (instruction set architecture)
COMPUTER ARCHITECTURE
http://www.tomshardware.com/reviews/intel-cpu-history,1986-6.html 19
Pentium main 5
4
Features
- 0.8 microns technology
- 60 MHz CLK freq.
- 100 MIPS
- 100% compatibility with
3
earlier generations
- 32 b registers
2
- 32 b Address BUS 6
- 64 b Data BUS
- superscalar architecture 1
- 2 pipeline (u,v) , ALU
- exec.2 simple instr./clk
- 2 cache memory (8k+8k) 5
- fast FPU (look-up table)
20
- Branch prediction (BTB)
4
5
6
2
5 3
21
http://bwrc.eecs.berkeley.edu/CIC/die_photos/pentium.gif
1. Register-Set
- General Purpose Registers ( EAX, EBX, …ESI, EDI)
- Segment Registers (ES,DS,CS, SS, FS, GS)
- Control Registers (CR0-CR4)
- Memory Management Registers
- Debug Registers (DR0-DR7), Test Registers
- EFLAGS Register
22
Instruction Pointer EFLAG Register
31 16 15 0 31 16 15 E0
General-Purpose Registers
Segment Registers
31 16 15 8 7 0
15 0
EAX AH AL
CS
EBX BH BL
SS
ECX CH CL
DS
EDX DH DL
ES
ESI SI
FS
EDI DI
GS
EBP BP
ESP SP
23
15 0 31 0 19 0
CR3 DR6
CR2 DR5
CR1 DR4
CR0 DR3
DR2
Test Registers
31 16 15 0
DR1
TR12
DR0
TR7
TR6
24
23 15 7
I I
V V
I A V R N O O O D I T S Z A P C
0 0 0 0 0 0 0 0 0 0 I I 0 0 0 1
D C M F T P P F F F F F F F F F
P F 0
31 L L
S = Status Flag
C = Control Flag
X = System Flag
30
PC program
I-1 I-2 I-3 I-4
fetch
memory
op1
read
op2
registers registers
instruction
I-1
register
decode
write
write
flags ALU
execute
(output)
Stages
S1 S2 S3 S4 S5 S6
1 I-1 Stages
2 I-1
3 I-1 S1 S2 S3 S4 S5 S6
4 I-1 1 I-1
5 I-1
2 I-2 I-1
Cycles
6 I-1
Cycles
Problem Solution
Stages Stages
exe
S1 S2 S3 S4 S5 S6 S4
1 I-1 S1 S2 S3 u v S5 S6
2 I-2 I-1 1 I-1
3 I-3 I-2 I-1 2 I-2 I-1
4 I-3 I-2 I-1 3 I-3 I-2 I-1
Cycles
Cycles
5 I-3 I-1 4 I-4 I-3 I-2 I-1
6 I-2 I-1 5 I-4 I-3 I-1 I-2
7 I-2 I-1 6 I-4 I-3 I-2 I-1
8 I-3 I-2 7 I-3 I-4 I-2 I-1
9 I-3 I-2 8 I-4 I-3 I-2
10 I-3 9 I-4 I-3
11 I-3 10 I-4
32
33
PFetch
• Moves 16 bytes of instruction stream into code queue
• Not required every time
– About ~5 instructions fetched at once
– Only useful if don’t branch
D1
• Determine total instruction length
– Signals code queue aligner where next instruction begins
• May require two cycles
– When multiple operands must be decoded
– About 6% of “typical” DOS program
34
D2
• Extract memory displacements/ immediate operands
• Compute memory addresses
– Add base register, and possibly scaled index register
• May require two cycles
– If index register involved, or both address / immediate operand
– Approx. 5% of executed instructions
EX
• Read register operands
• Compute ALU function
• Read or write memory (data cache)
WB
• Update register result/state
35
36
-Simultaneous or sequential execution is decided in the D1 phase
40
Pentium Floating Point Pipeline
D2 D2
Regis ter
Stack
WB WB/X1 X2 ST(0)-
ST(7)
WF
Cannot pair FP instructions Adder
• 256 entries
- appear in the D1 phase for conditional jump instr. (near)
- to every new jump, μP store the jump Instr. addr. and the jump dest. Addr.
- The μP explore the BTB (256 entries)
- If the addr. is in BTB it’ s supposed that the jump is made to this address
- Only in the execution phase we know if the jump must be done
- If the jump must be executed, it was correct predicted ==> no delay
42
Need Address at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index to get prediction AND
branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch
address
Branch PC Predicted PC
PC of instruction
FETCH
=? Extra
Yes: instruction is prediction state
branch and use bits
No: branch not predicted PC as
predicted, proceed normally next PC 43
(Next PC = PC+4)
.Program
.
$ Add ax,cx
BTB
$+2 Cmp ax, 0
0 $+4 addr.1
$+4 Jc add1 if 1 Jump address 2 Dest. Addr.2
C=1
$+6 Sub cx,2 2 Jump address 3 Dest. Addr.3
.
.
.
45
• Branch Processing
• Update BTB
HW: Study
http://www.x86.org/articles/branch/branchprediction.htm#fig3
46
BRANCH PREDICTION OPTIMIZATION
53
6. Pentium Buses
Prefetch
Buffers Pipelined
U pipe V pipeline Floating-Point
Unit
64 bit bus Integer Integer
Interface ALU ALU
Multiply
32 bits Register Set
Add
64 bits
256 bits Divide
Data Cache
54
• The Pentium processors have a 64 bits data bus
– Pentium is a 32 bit CPU due to having 32 bits registers.
– A standard Single Transfer Cycle can read or write up to 64 bits at
a time (8 bytes – the width of DBUS)
• Burst read and burst write-back cycles are supported by the Pentium
processors
– Burst Mode cycles are used for Cache operations and transfer 32
bytes in 4 clocks (4 * 8 bytes = 4 * 64 bits=256 bits).
• 32 bytes is the size of the Pentium Cache line.
– For the Pentium, all cache operations are burst cycles.
55
• Prefetch Buffers:
– Four prefetch buffers within
the processor works as two
independent pairs.
• When instructions are
prefetched from cache, they
are placed into one set of
prefetch buffers.
• The other set is used as
when a branch operation is
predicted.
– Prefetch buffer sends a pair
of instructions to instruction
decoder
56
• Instruction Decode Unit:
It occurs in two stages – Decode1
(D1) and Decode2(D2)
-D1 checks whether instructions
can be paired
-D2 calculates the address of
memory resident operands
• Control Unit :
-This unit interprets the instruction
word and microcode entry point
fed to it by Instruction Decode Unit
-It handles exceptions, breakpoints
and interrupts.
-It controls the integer pipelines
and floating point sequences
• Microcode ROM :
-Stores microcode sequences
57