Pentium E1

C1.
PENTIUM
Overview
http://developer.intel.com/software/products/itc/architec/ia32/pentdown.htm
http://bwrc.eecs.berkeley.edu/CIC/archive/cpu_history.html
http://bwrc.eecs.berkeley.edu/CIC/die_photos/ 4
http://or1cedar.intel.com/media/training/intro_ht_dt_v1/tutorial/index.htm5
http://bwrc.eecs.berkeley.edu/CIC/
 >104 increase in transistor count&clock frequency
over 30 years! 6
X86 μP Operating Modes
7
MIN 8086 80286
MAX Real Protected

(1MB) (16MB)
8086/8088
80286
16 bits 16 bits 32 bits 16 bits
8086 80286 80386 8086

Real Mode Protected Protected Virtual
(1MB) (16MB) (4GB) Real Mode
(1MB)
80386
8
HOW to IMPROVE PROCESSORS ?
Means of Increasing Performance by:

- increasing CLK frequency
- improving architecture
- instruction set
- data size 16/32/64
- increasing cache memory capacity
9
The Heat Problem
Rocket Nozzle
1000
Nuclear Reactor
Pentium 4
(Prescott)
Watts/cm2
100
Pentium 4
(Willamette)
Pentium III
Hot Plate Pentium II
10
Pentium Pro
Pentium
i386
i486 • Heat sinks in 6XX
1
1.5 1.0 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07
series Pentium 4s
Increasing Frequency
- Pentium: 60 MHz to 3,800 MHz in 12 years

- Resulted in ~80% of performance increase
Execution Optimization
• More powerful instructions
• Execution optimization (pipelining, branch prediction, execution of multiple
instructions, reordering instruction stream, hyperthreading, multicore etc.)
106 Multi-Threaded, Multi-Core

Pentium 4 and Xeon Architecture with HT
105 Multi-Threaded
Era of
Pentium 4 Architecture
MIPS
104 Trace Cache Thread

Parallelism
Pentium Pro Architecture
103 Speculative Out-of-Order
Era of
Pentium Architecture
102 Super Scalar Instruction
Parallelism
101
1980 1985 1990 1995 2000 2005 2010
Microarchitecture Trends
11
Adapted from Johan De Gelas, Quest for More Processing Power, AnandTech, Feb. 8, 2005.
Larger Caches
 On-chip caches to ameliorate the growing disparity between processor speed
and memory latency and bandwidth
 On-chip caches will continue to increase in size and help mitigate disparities
in computer subsystem performance
1011
1010 2G 4G
512M 1G
109 Memory 256M
Transistors Per Die
Microprocessor 128M Itanium®

108 64M Pentium® 4
16M ®
107 4M Pentium III
1M Pentium® II
6 Pentium®
10 256K i486™
5 64K
10 i386™
4K 16K 80286
104 8080
1K
8086
103 4004
102
101 Moore’s Law Still Holds
100
’60 ’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 ’05 ’10
Source: Intel
12
New Technologies for Computers
• Low Power Processors

• Multicore Chips
 Architecture of Dual-Core Chips
- IBM Power5
- Shared 1.92 Mbyte L2 cache
 AMD Opteron
- Separate 1 Mbyte L2 caches
- CPU0 and CPU1 communicate through the SRQ
 Intel Pentium 4
- “Glued” two processors together
13
The Personal Computer Architecture
Speaker
Timer logic Processor

(8254) (8086 ….. Coprocessor System 640KB
Pentium…) (8087…..387 ROM DRAM
System bus (data, address & control signals)
Keyboard DMA Expansion Interrupt

logic (8255) Controller logic logic (8259)
(8237)
Video card
Disk controller
Serial port
Keyboard ... 14
Extension slots
x86 IA-32 Features
Micro architecture +
ISA (instruction set architecture)
COMPUTER ARCHITECTURE
Multiple Data Sizes and Addressing Methods

• Recent generations optimized for 32-bit mode
Limited Number of Registers
• Stack-oriented procedure call and FP instructions
• Programs reference memory heavily (41%)
Variable Length Instructions (CISC)
• First few bytes describe operation and operands
• Remaining ones give immediate data & address displacements
• Average is 2.5 bytes instr. lenght 15
• Data transfer
• Common ALU operations
• Control flow
• String
• FP, MMX and SSE instructions
• OS support
• I/O instructions
• “Exotic” instructions (DAA, XLAT etc.) 16
x86 - Complex Instructions Set
CISC drawback:
• Most instructions are so complicated, they have to be broken into a
sequence of micro-steps
• These steps are called Micro-Code
• Stored in a ROM in the processor core
• Micro-code ROM: Access-time and size...
• They require extra ROM and decode logic
RISC: “Less is More”

• RISC = Reduced Instruction set Computer
• 20/80 Rule: 20% of the instructions take up 80% of the ex. time
• Sometimes executing a sequence of simple instructions runs quicker
than a single complex instruction that has the same effect
17
RISC Background
• Reduce the instruction set to simplify the decoding
Smaller Instruction Set -> Simpler Logic -> Smaller Logic -> Faster Execution
• Eliminate microcode –> hardwired all instruction execution
• Pipeline instruction decoding and executing – do more operations in
parallel
• Load/Store Architecture – only the load and store instructions can
access memory
– All other instructions work with the processor internal registers
– This is necessary for single-cycle execution – the execution unit can’t
wait for data to be read/written
• Increase number of internal register due to Load/Store Architecture
• Also registers are more general purpose and less associated with
specific functions
• Compiler designed along with the RISC processor design
• Compiler has to be aware of the processor architecture to produce
code that can be executed efficiently 18
PENTIUM – RISC Like Architecture
• The Pentium has two execution engines,

one for simple instructions and
the other for all instructions
• The simple execution engine runs a subset of the instructions

This is optimized for fast execution
• Complex instructions are executed by another execution engine
http://www.tomshardware.com/reviews/intel-cpu-history,1986-6.html 19
Pentium main 5
4
Features
- 0.8 microns technology
- 60 MHz CLK freq.
- 100 MIPS
- 100% compatibility with
3
earlier generations
- 32 b registers
2
- 32 b Address BUS 6
- 64 b Data BUS
- superscalar architecture 1
- 2 pipeline (u,v) , ALU
- exec.2 simple instr./clk
- 2 cache memory (8k+8k) 5
- fast FPU (look-up table)
20
- Branch prediction (BTB)
4
5
6
2
5 3
21
http://bwrc.eecs.berkeley.edu/CIC/die_photos/pentium.gif
1. Register-Set
- General Purpose Registers ( EAX, EBX, …ESI, EDI)
- Segment Registers (ES,DS,CS, SS, FS, GS)
- Control Registers (CR0-CR4)
- Memory Management Registers
- Debug Registers (DR0-DR7), Test Registers
- EFLAGS Register
22
Instruction Pointer EFLAG Register
31 16 15 0 31 16 15 E0
EIP IP EFLAG FLAG
General-Purpose Registers
Segment Registers
31 16 15 8 7 0
15 0
EAX AH AL
CS
EBX BH BL
SS
ECX CH CL
DS
EDX DH DL
ES
ESI SI
FS
EDI DI
GS
EBP BP
ESP SP
23
15 0 31 0 19 0
TR TSS Selector TSS Base Address TSS Limit
LDTR LDTSS Selector LDT Base Address LDT Limit
IDTR IDT Base Address IDT Limit
GDTR GDT Base Address GDT Limit
Control Registers Debug Registers

31 16 15 0 16 15
31 0
CR4 DR7
CR3 DR6
CR2 DR5
CR1 DR4
CR0 DR3
DR2
Test Registers
31 16 15 0
DR1
TR12
DR0
TR7
TR6
24
23 15 7
I I
V V
I A V R N O O O D I T S Z A P C
0 0 0 0 0 0 0 0 0 0 I I 0 0 0 1
D C M F T P P F F F F F F F F F
P F 0
31 L L
ID X ID Flag (CPUID support) DF C Direction Flag

VIP X Virtual Interrupt Pending IF X Interrupt Enable Flag
VIF X Virtual Interrupt Flag TF X Trap Flag
AC X Alignment Check SF S Sign Flag
VM X Virtual 8086 Mode ZF S Zero Flag
RF X Resume Flag AF S Auxiliary Carry Flag
NT X Nested Task PF S Parity Flag
IOPL X I/O Privilege Level CF S Carry Flag
OF S Overflow Flag
S = Status Flag
C = Control Flag
X = System Flag
Bit Positions shown as “0” or “1” are

Intel reserved. 27
EFLAGS
80286 and up:
• IOPL (I/O privilege level) : It holds the privilege level at which your
code must be running in order to execute any I/O-related instructions.
00 is the highest.
• NT (Nested Task) : Set when one system task has invoked another
through a CALL instruction in protected mode.
80386 and up:
• RF (Resume) : Used with debugging to selectively mask some
exceptions.
• VM (Virtual Mode) : When 0, the CPU can operate in Protected
mode, 286 Emulation mode or Real mode. When set, the CPU is
converted to a high speed 8086. This bit has enormous impact.
80486SX and up:
• AC (Alignment Check) : Specialized instruction for the 80486SX.
Pentium and up:
• VIF (Virtual Interrupt Flag) : Copy of the interrupt flag bit.
• VIP (Virtual Interrupt Pending) : Provides information about a virtual
mode interrupt.
• ID (Identification) : Supports the CPUID instruction, which provides
version number and manufacturer information about the 28
microprocessor and other
2. Superscalar integer execution units (ALU)
 Pipelining is a technique of decomposing a sequential process into

sub-operations, with each sub-process being executed in a special
dedicated segment that operates concurrently with all other 29
segments
• Two almost independent integer pipelines and a floating point
pipeline
• Short command execution, through many hardwired instructions
• Binary compatibility for complex i386 instructions through a micro
programmed CISC unit
30
PC program
I-1 I-2 I-3 I-4
fetch
memory
op1
read
op2
registers registers
instruction
I-1
register
decode
write
write
flags ALU
execute
(output)
Stages
S1 S2 S3 S4 S5 S6
1 I-1 Stages
2 I-1
3 I-1 S1 S2 S3 S4 S5 S6
4 I-1 1 I-1
5 I-1
2 I-2 I-1
Cycles
6 I-1
Cycles
7 I-2 3 I-2 I-1

8 I-2 4 I-2 I-1
9 I-2
5 I-2 I-1
10 I-2
11 I-2
6 I-2 I-1
31
12 I-2 7 I-2
 Superscalar Architecture
Problem Solution
Stages Stages
exe
S1 S2 S3 S4 S5 S6 S4
1 I-1 S1 S2 S3 u v S5 S6
2 I-2 I-1 1 I-1
3 I-3 I-2 I-1 2 I-2 I-1
4 I-3 I-2 I-1 3 I-3 I-2 I-1
Cycles
Cycles
5 I-3 I-1 4 I-4 I-3 I-2 I-1
6 I-2 I-1 5 I-4 I-3 I-1 I-2
7 I-2 I-1 6 I-4 I-3 I-2 I-1
8 I-3 I-2 7 I-3 I-4 I-2 I-1
9 I-3 I-2 8 I-4 I-3 I-2
10 I-3 9 I-4 I-3
11 I-3 10 I-4
32
33
PFetch
• Moves 16 bytes of instruction stream into code queue
• Not required every time
– About ~5 instructions fetched at once
– Only useful if don’t branch
D1
• Determine total instruction length
– Signals code queue aligner where next instruction begins
• May require two cycles
– When multiple operands must be decoded
– About 6% of “typical” DOS program
34
D2
• Extract memory displacements/ immediate operands
• Compute memory addresses
– Add base register, and possibly scaled index register
• May require two cycles
– If index register involved, or both address / immediate operand
– Approx. 5% of executed instructions
EX
• Read register operands
• Compute ALU function
• Read or write memory (data cache)
WB
• Update register result/state
35
36
-Simultaneous or sequential execution is decided in the D1 phase
Pentium Execute Instructions I1 & I2 in Parallel if:

 Both are “simple” instructions
• Don’t require microcode sequencing (wired)
• Some operations require U-pipe resources
• 90% of instructions
 There are not data dependencies between instr.
• I1 is not a jump
• Destination of I1 not source of I2
– But can handle I1 setting CC
and I2 being cond. jump
• Destination of I1 not
destination of I2
 The instr. have not in the same
time offset and immediate op.
 instr. with prefix are executed

only in “pipe u”
37
PENTIUM PIPELINE STAGES
38
39
3. FPU
- 8 stages pipeline (the first 4 in the pipeline U)
- accepted formats 32/64/80 bits >> IEEE 754/85
- FPU operations implemented in “look-up tables” (RISC)
- results are wired in tables and the operands are index
-- Pipelined FPU is 2..10 times faster than 486 FPU.
- FDIV bug! Free replacement…

962,306,957,033 / 11,010,046 = 87,402.6282027341 (correct answer)
-
962,306,957,033 / 11,010,046 = 87,399.5805831329 (flawed Pentium)
40
Pentium Floating Point Pipeline
Instruction Fetch FP Pipeline has 8 stage

(IF) Shares first 4 stages with U integer pipeline
D1 D1 WB of U is first execution stage of FP pipline
(v-pipe) (u-pipe)
D2 D2
Floating Point Unit

EX EX
Regis ter
Stack
WB WB/X1 X2 ST(0)-
ST(7)
WF
Cannot pair FP instructions Adder
(except FXCH) in the V pipeline Multiplier

ER Divider
41
Nothing is so hard to predict like the future.
4. BTB -Branch Target Buffer - Branch Prediction

• Stores information about previously executed branches
– Indexed by instruction address
– Specifies branch destination + whether or not taken
• 256 entries
- appear in the D1 phase for conditional jump instr. (near)
- to every new jump, μP store the jump Instr. addr. and the jump dest. Addr.
- The μP explore the BTB (256 entries)
- If the addr. is in BTB it’ s supposed that the jump is made to this address
- Only in the execution phase we know if the jump must be done
- If the jump must be executed, it was correct predicted ==> no delay
42
Need Address at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index to get prediction AND
branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch
address
Branch PC Predicted PC
PC of instruction
FETCH
=? Extra
Yes: instruction is prediction state
branch and use bits
No: branch not predicted PC as
predicted, proceed normally next PC 43
(Next PC = PC+4)
.Program
.
$ Add ax,cx
BTB
$+2 Cmp ax, 0
0 $+4 addr.1
$+4 Jc add1 if 1 Jump address 2 Dest. Addr.2
C=1
$+6 Sub cx,2 2 Jump address 3 Dest. Addr.3
.
.
add1: xor ax,76h

. FF Jump address FF Dest. Addr. FF
.
45
• Branch Processing
• Look for instruction in BTB
• If found, start fetching at destination
• Branch condition resolved early in WB stage

– If prediction correct, no branch penalty
– If prediction incorrect, lose ~3 cycles
» Which corresponds to ==> 3 instructions
• Update BTB
HW: Study
http://www.x86.org/articles/branch/branchprediction.htm#fig3
46
BRANCH PREDICTION OPTIMIZATION
- Eliminating and Reducing the Number of Branches
• Removing the possibility of branch mispredictions

• Reducing the number of BTB entries require
Using replacement instructions instead of branch instruction:

 SETcc
 CMOVcc or FCMOVcc
!! Combine JNE (JGE etc.) and MOV instructions into one 48
BRANCH PREDICTION OPTIMIZATION
Ex.1.IF( A>B; EBX=C2 else EBX=C1)
Original instruction Optimized instruction

cmp A,B ;A-B>>flags xor ebx, ebx ;EBX=0
jge E0 ;If S=O? jump cmp A,B ;A-B=>flags
mov ebx,C1 ; EBX=C1 SETGE ebx ;If A>B,S=O,EBX=1
dec ebx ;EBX=EBX-1 ,if A>B EBX=0else EBX=FFFFh
jmp E1
and ebx, (C1-C2) ; if A>B, EBX=0 else=C1-C2
E0:
add ebx, C2 ; if A>B EBX=C2 else=C1
mov ebx,C2
E1:
Ex.2. ECX=ECX+val, IF C=1 ECX=0 else ECX=ECX+val

Original instruction Optimized instruction
XOR EBX,EBX ;EBX=0 XOR EBX,EBX ; EBX=0

ADD ECX, [Val] ;ECX=ECX+val ADD ECX,[Val] ; ECX= ECX+ val
JNC Continue ;If C=0 jump CMOVC ECX,EBX ; If C=1, ECX=EBX=0
MOV ECX,EBX ;If C=1,ECX=EBX=0 ;if C=0 continue
Continue: Continue: ; else ECX= ECX+ val50
5. Cache Memory 16KB (8k+8k)
•There are two separate
8kB caches – one for code
and one for data.
• Each cache has a separate
address translation TLB
which translates linear
addresses to physical.
•Code Cache:
–2 way set associative
cache
–256 lines b/w code
cache and prefetch
buffer, permitting
prefetching of 32 bytes
(256/8) of instructions
51
52
What is L1 and L2?
• Level-1 and Level-2 caches
• The cache memories in a computer are much
faster than DRAM
• L1 is built on the microprocessor chip itself
• L2 is a seperate chip
• L2 cache is much larger than L1 cache
Separate Code and Data caches

• On chip 8KB code and
8KB write back data cache.
•Two way set associative.
•MESI Cache protocol
53
6. Pentium Buses
Code Cache Branch

Prediction
Prefetch
Buffers Pipelined
U pipe V pipeline Floating-Point
Unit
64 bit bus Integer Integer
Interface ALU ALU
Multiply
32 bits Register Set
Add
64 bits
256 bits Divide
Data Cache
54
• The Pentium processors have a 64 bits data bus
– Pentium is a 32 bit CPU due to having 32 bits registers.
– A standard Single Transfer Cycle can read or write up to 64 bits at
a time (8 bytes – the width of DBUS)
• Burst read and burst write-back cycles are supported by the Pentium
processors
– Burst Mode cycles are used for Cache operations and transfer 32
bytes in 4 clocks (4 * 8 bytes = 4 * 64 bits=256 bits).
• 32 bytes is the size of the Pentium Cache line.
– For the Pentium, all cache operations are burst cycles.
55
• Prefetch Buffers:
– Four prefetch buffers within
the processor works as two
independent pairs.
• When instructions are
prefetched from cache, they
are placed into one set of
prefetch buffers.
• The other set is used as
when a branch operation is
predicted.
– Prefetch buffer sends a pair
of instructions to instruction
decoder
56
• Instruction Decode Unit:
It occurs in two stages – Decode1
(D1) and Decode2(D2)
-D1 checks whether instructions
can be paired
-D2 calculates the address of
memory resident operands
• Control Unit :
-This unit interprets the instruction
word and microcode entry point
fed to it by Instruction Decode Unit
-It handles exceptions, breakpoints
and interrupts.
-It controls the integer pipelines
and floating point sequences
• Microcode ROM :
-Stores microcode sequences
57

Pentium E1

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Pentium E1

Загружено:

Авторское право:

Доступные форматы

C1.

MAX Real Protected

16 bits 16 bits 32 bits 16 bits

8086 80286 80386 8086

Means of Increasing Performance by:

- Pentium: 60 MHz to 3,800 MHz in 12 years

106 Multi-Threaded, Multi-Core

104 Trace Cache Thread

1980 1985 1990 1995 2000 2005 2010

Microprocessor 128M Itanium®

• Low Power Processors

Timer logic Processor

System bus (data, address & control signals)

Keyboard DMA Expansion Interrupt

Multiple Data Sizes and Addressing Methods

RISC: “Less is More”

• The Pentium has two execution engines,

• The simple execution engine runs a subset of the instructions

• Complex instructions are executed by another execution engine

EIP IP EFLAG FLAG

TR TSS Selector TSS Base Address TSS Limit

LDTR LDTSS Selector LDT Base Address LDT Limit

IDTR IDT Base Address IDT Limit

GDTR GDT Base Address GDT Limit

Control Registers Debug Registers

ID X ID Flag (CPUID support) DF C Direction Flag

Bit Positions shown as “0” or “1” are

 Pipelining is a technique of decomposing a sequential process into

7 I-2 3 I-2 I-1

Pentium Execute Instructions I1 & I2 in Parallel if:

 instr. with prefix are executed

- FDIV bug! Free replacement…

962,306,957,033 / 11,010,046 = 87,399.5805831329 (flawed Pentium)

Instruction Fetch FP Pipeline has 8 stage

Floating Point Unit

(except FXCH) in the V pipeline Multiplier

4. BTB -Branch Target Buffer - Branch Prediction

add1: xor ax,76h

• Look for instruction in BTB

• If found, start fetching at destination

• Branch condition resolved early in WB stage

- Eliminating and Reducing the Number of Branches

• Removing the possibility of branch mispredictions

Using replacement instructions instead of branch instruction:

Original instruction Optimized instruction

Ex.2. ECX=ECX+val, IF C=1 ECX=0 else ECX=ECX+val

XOR EBX,EBX ;EBX=0 XOR EBX,EBX ; EBX=0

Separate Code and Data caches

Code Cache Branch

Вам также может понравиться