Вы находитесь на странице: 1из 49

C1.

PENTIUM

Overview

http://developer.intel.com/software/products/itc/architec/ia32/pentdown.htm
http://bwrc.eecs.berkeley.edu/CIC/archive/cpu_history.html
http://bwrc.eecs.berkeley.edu/CIC/die_photos/ 4
http://or1cedar.intel.com/media/training/intro_ht_dt_v1/tutorial/index.htm5
http://bwrc.eecs.berkeley.edu/CIC/
 >104 increase in transistor count&clock frequency
over 30 years! 6
X86 μP Operating Modes

7
MIN 8086 80286

MAX Real Protected


(1MB) (16MB)
8086/8088
80286

16 bits 16 bits 32 bits 16 bits

8086 80286 80386 8086


Real Mode Protected Protected Virtual
(1MB) (16MB) (4GB) Real Mode
(1MB)
80386

8
HOW to IMPROVE PROCESSORS ?

Means of Increasing Performance by:


- increasing CLK frequency
- improving architecture
- instruction set
- data size 16/32/64
- increasing cache memory capacity

9
The Heat Problem
Rocket Nozzle
1000

Nuclear Reactor
Pentium 4
(Prescott)
Watts/cm2

100
Pentium 4
(Willamette)

Pentium III
Hot Plate Pentium II
10
Pentium Pro
Pentium
i386
i486 • Heat sinks in 6XX
1
1.5 1.0 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07
series Pentium 4s
Increasing Frequency

- Pentium: 60 MHz to 3,800 MHz in 12 years


- Resulted in ~80% of performance increase
Execution Optimization
• More powerful instructions
• Execution optimization (pipelining, branch prediction, execution of multiple
instructions, reordering instruction stream, hyperthreading, multicore etc.)

106 Multi-Threaded, Multi-Core


Pentium 4 and Xeon Architecture with HT
105 Multi-Threaded
Era of
Pentium 4 Architecture
MIPS

104 Trace Cache Thread


Parallelism
Pentium Pro Architecture
103 Speculative Out-of-Order
Era of
Pentium Architecture
102 Super Scalar Instruction
Parallelism
101

1980 1985 1990 1995 2000 2005 2010

Microarchitecture Trends
11
Adapted from Johan De Gelas, Quest for More Processing Power, AnandTech, Feb. 8, 2005.
Larger Caches
 On-chip caches to ameliorate the growing disparity between processor speed
and memory latency and bandwidth
 On-chip caches will continue to increase in size and help mitigate disparities
in computer subsystem performance
1011
1010 2G 4G
512M 1G
109 Memory 256M
Transistors Per Die

Microprocessor 128M Itanium®


108 64M Pentium® 4
16M ®
107 4M Pentium III
1M Pentium® II
6 Pentium®
10 256K i486™
5 64K
10 i386™
4K 16K 80286
104 8080
1K
8086
103 4004
102
101 Moore’s Law Still Holds
100
’60 ’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 ’05 ’10
Source: Intel
12
New Technologies for Computers

• Low Power Processors


• Multicore Chips
 Architecture of Dual-Core Chips
- IBM Power5
- Shared 1.92 Mbyte L2 cache
 AMD Opteron
- Separate 1 Mbyte L2 caches
- CPU0 and CPU1 communicate through the SRQ
 Intel Pentium 4
- “Glued” two processors together

13
The Personal Computer Architecture
Speaker

Timer logic Processor


(8254) (8086 ….. Coprocessor System 640KB
Pentium…) (8087…..387 ROM DRAM

System bus (data, address & control signals)

Keyboard DMA Expansion Interrupt


logic (8255) Controller logic logic (8259)
(8237)
Video card
Disk controller
Serial port
Keyboard ... 14
Extension slots
x86 IA-32 Features

Micro architecture +
ISA (instruction set architecture)
COMPUTER ARCHITECTURE

Multiple Data Sizes and Addressing Methods


• Recent generations optimized for 32-bit mode
Limited Number of Registers
• Stack-oriented procedure call and FP instructions
• Programs reference memory heavily (41%)
Variable Length Instructions (CISC)
• First few bytes describe operation and operands
• Remaining ones give immediate data & address displacements
• Average is 2.5 bytes instr. lenght 15
• Data transfer
• Common ALU operations
• Control flow
• String
• FP, MMX and SSE instructions
• OS support
• I/O instructions
• “Exotic” instructions (DAA, XLAT etc.) 16
x86 - Complex Instructions Set
CISC drawback:
• Most instructions are so complicated, they have to be broken into a
sequence of micro-steps
• These steps are called Micro-Code
• Stored in a ROM in the processor core
• Micro-code ROM: Access-time and size...
• They require extra ROM and decode logic

RISC: “Less is More”


• RISC = Reduced Instruction set Computer
• 20/80 Rule: 20% of the instructions take up 80% of the ex. time
• Sometimes executing a sequence of simple instructions runs quicker
than a single complex instruction that has the same effect
17
RISC Background
• Reduce the instruction set to simplify the decoding
Smaller Instruction Set -> Simpler Logic -> Smaller Logic -> Faster Execution
• Eliminate microcode –> hardwired all instruction execution
• Pipeline instruction decoding and executing – do more operations in
parallel
• Load/Store Architecture – only the load and store instructions can
access memory
– All other instructions work with the processor internal registers
– This is necessary for single-cycle execution – the execution unit can’t
wait for data to be read/written
• Increase number of internal register due to Load/Store Architecture
• Also registers are more general purpose and less associated with
specific functions
• Compiler designed along with the RISC processor design
• Compiler has to be aware of the processor architecture to produce
code that can be executed efficiently 18
PENTIUM – RISC Like Architecture

• The Pentium has two execution engines,


one for simple instructions and
the other for all instructions

• The simple execution engine runs a subset of the instructions


This is optimized for fast execution

• Complex instructions are executed by another execution engine

http://www.tomshardware.com/reviews/intel-cpu-history,1986-6.html 19
Pentium main 5
4
Features
- 0.8 microns technology
- 60 MHz CLK freq.
- 100 MIPS
- 100% compatibility with
3
earlier generations
- 32 b registers
2
- 32 b Address BUS 6
- 64 b Data BUS
- superscalar architecture 1
- 2 pipeline (u,v) , ALU
- exec.2 simple instr./clk
- 2 cache memory (8k+8k) 5
- fast FPU (look-up table)
20
- Branch prediction (BTB)
4
5

6
2

5 3

21
http://bwrc.eecs.berkeley.edu/CIC/die_photos/pentium.gif
1. Register-Set
- General Purpose Registers ( EAX, EBX, …ESI, EDI)
- Segment Registers (ES,DS,CS, SS, FS, GS)
- Control Registers (CR0-CR4)
- Memory Management Registers
- Debug Registers (DR0-DR7), Test Registers
- EFLAGS Register

22
Instruction Pointer EFLAG Register
31 16 15 0 31 16 15 E0

EIP IP EFLAG FLAG

General-Purpose Registers
Segment Registers
31 16 15 8 7 0
15 0
EAX AH AL
CS

EBX BH BL
SS

ECX CH CL
DS

EDX DH DL
ES

ESI SI
FS

EDI DI
GS

EBP BP

ESP SP

23
15 0 31 0 19 0

TR TSS Selector TSS Base Address TSS Limit

LDTR LDTSS Selector LDT Base Address LDT Limit

IDTR IDT Base Address IDT Limit

GDTR GDT Base Address GDT Limit

Control Registers Debug Registers


31 16 15 0 16 15
31 0
CR4 DR7

CR3 DR6

CR2 DR5

CR1 DR4

CR0 DR3

DR2
Test Registers
31 16 15 0
DR1
TR12
DR0
TR7

TR6
24
23 15 7
I I
V V
I A V R N O O O D I T S Z A P C
0 0 0 0 0 0 0 0 0 0 I I 0 0 0 1
D C M F T P P F F F F F F F F F
P F 0
31 L L

ID X ID Flag (CPUID support) DF C Direction Flag


VIP X Virtual Interrupt Pending IF X Interrupt Enable Flag
VIF X Virtual Interrupt Flag TF X Trap Flag
AC X Alignment Check SF S Sign Flag
VM X Virtual 8086 Mode ZF S Zero Flag
RF X Resume Flag AF S Auxiliary Carry Flag
NT X Nested Task PF S Parity Flag
IOPL X I/O Privilege Level CF S Carry Flag
OF S Overflow Flag

S = Status Flag
C = Control Flag
X = System Flag

Bit Positions shown as “0” or “1” are


Intel reserved. 27
EFLAGS
80286 and up:
• IOPL (I/O privilege level) : It holds the privilege level at which your
code must be running in order to execute any I/O-related instructions.
00 is the highest.
• NT (Nested Task) : Set when one system task has invoked another
through a CALL instruction in protected mode.
80386 and up:
• RF (Resume) : Used with debugging to selectively mask some
exceptions.
• VM (Virtual Mode) : When 0, the CPU can operate in Protected
mode, 286 Emulation mode or Real mode. When set, the CPU is
converted to a high speed 8086. This bit has enormous impact.
80486SX and up:
• AC (Alignment Check) : Specialized instruction for the 80486SX.
Pentium and up:
• VIF (Virtual Interrupt Flag) : Copy of the interrupt flag bit.
• VIP (Virtual Interrupt Pending) : Provides information about a virtual
mode interrupt.
• ID (Identification) : Supports the CPUID instruction, which provides
version number and manufacturer information about the 28
microprocessor and other
2. Superscalar integer execution units (ALU)

 Pipelining is a technique of decomposing a sequential process into


sub-operations, with each sub-process being executed in a special
dedicated segment that operates concurrently with all other 29
segments
• Two almost independent integer pipelines and a floating point
pipeline
• Short command execution, through many hardwired instructions
• Binary compatibility for complex i386 instructions through a micro
programmed CISC unit

30
PC program
I-1 I-2 I-3 I-4

fetch
memory
op1
read
op2
registers registers

instruction
I-1
register

decode
write
write

flags ALU

execute
(output)

Stages
S1 S2 S3 S4 S5 S6
1 I-1 Stages
2 I-1
3 I-1 S1 S2 S3 S4 S5 S6
4 I-1 1 I-1
5 I-1
2 I-2 I-1
Cycles

6 I-1
Cycles

7 I-2 3 I-2 I-1


8 I-2 4 I-2 I-1
9 I-2
5 I-2 I-1
10 I-2
11 I-2
6 I-2 I-1
31
12 I-2 7 I-2
 Superscalar Architecture

Problem Solution
Stages Stages
exe
S1 S2 S3 S4 S5 S6 S4
1 I-1 S1 S2 S3 u v S5 S6
2 I-2 I-1 1 I-1
3 I-3 I-2 I-1 2 I-2 I-1
4 I-3 I-2 I-1 3 I-3 I-2 I-1
Cycles

Cycles
5 I-3 I-1 4 I-4 I-3 I-2 I-1
6 I-2 I-1 5 I-4 I-3 I-1 I-2
7 I-2 I-1 6 I-4 I-3 I-2 I-1
8 I-3 I-2 7 I-3 I-4 I-2 I-1
9 I-3 I-2 8 I-4 I-3 I-2
10 I-3 9 I-4 I-3
11 I-3 10 I-4

32
33
PFetch
• Moves 16 bytes of instruction stream into code queue
• Not required every time
– About ~5 instructions fetched at once
– Only useful if don’t branch

D1
• Determine total instruction length
– Signals code queue aligner where next instruction begins
• May require two cycles
– When multiple operands must be decoded
– About 6% of “typical” DOS program

34
D2
• Extract memory displacements/ immediate operands
• Compute memory addresses
– Add base register, and possibly scaled index register
• May require two cycles
– If index register involved, or both address / immediate operand
– Approx. 5% of executed instructions

EX
• Read register operands
• Compute ALU function
• Read or write memory (data cache)

WB
• Update register result/state

35
36
-Simultaneous or sequential execution is decided in the D1 phase

Pentium Execute Instructions I1 & I2 in Parallel if:


 Both are “simple” instructions
• Don’t require microcode sequencing (wired)
• Some operations require U-pipe resources
• 90% of instructions
 There are not data dependencies between instr.
• I1 is not a jump
• Destination of I1 not source of I2
– But can handle I1 setting CC
and I2 being cond. jump
• Destination of I1 not
destination of I2
 The instr. have not in the same
time offset and immediate op.

 instr. with prefix are executed


only in “pipe u”
37
PENTIUM PIPELINE STAGES
38
39
3. FPU
- 8 stages pipeline (the first 4 in the pipeline U)
- accepted formats 32/64/80 bits >> IEEE 754/85
- FPU operations implemented in “look-up tables” (RISC)
- results are wired in tables and the operands are index
-- Pipelined FPU is 2..10 times faster than 486 FPU.

- FDIV bug! Free replacement…


962,306,957,033 / 11,010,046 = 87,402.6282027341 (correct answer)
-

962,306,957,033 / 11,010,046 = 87,399.5805831329 (flawed Pentium)

40
Pentium Floating Point Pipeline

Instruction Fetch FP Pipeline has 8 stage


(IF) Shares first 4 stages with U integer pipeline
D1 D1 WB of U is first execution stage of FP pipline
(v-pipe) (u-pipe)

D2 D2

Floating Point Unit


EX EX

Regis ter
Stack
WB WB/X1 X2 ST(0)-
ST(7)

WF
Cannot pair FP instructions Adder

(except FXCH) in the V pipeline Multiplier


ER Divider
41
Nothing is so hard to predict like the future.

4. BTB -Branch Target Buffer - Branch Prediction


• Stores information about previously executed branches
– Indexed by instruction address
– Specifies branch destination + whether or not taken

• 256 entries
- appear in the D1 phase for conditional jump instr. (near)
- to every new jump, μP store the jump Instr. addr. and the jump dest. Addr.
- The μP explore the BTB (256 entries)
- If the addr. is in BTB it’ s supposed that the jump is made to this address
- Only in the execution phase we know if the jump must be done
- If the jump must be executed, it was correct predicted ==> no delay

42
Need Address at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index to get prediction AND
branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch
address
Branch PC Predicted PC
PC of instruction
FETCH

=? Extra
Yes: instruction is prediction state
branch and use bits
No: branch not predicted PC as
predicted, proceed normally next PC 43
(Next PC = PC+4)
.Program
.
$ Add ax,cx
BTB
$+2 Cmp ax, 0
0 $+4 addr.1
$+4 Jc add1 if 1 Jump address 2 Dest. Addr.2
C=1
$+6 Sub cx,2 2 Jump address 3 Dest. Addr.3
.
.

add1: xor ax,76h


. FF Jump address FF Dest. Addr. FF

.
45
• Branch Processing

• Look for instruction in BTB

• If found, start fetching at destination

• Branch condition resolved early in WB stage


– If prediction correct, no branch penalty
– If prediction incorrect, lose ~3 cycles
» Which corresponds to ==> 3 instructions

• Update BTB

HW: Study
http://www.x86.org/articles/branch/branchprediction.htm#fig3

46
BRANCH PREDICTION OPTIMIZATION

- Eliminating and Reducing the Number of Branches

• Removing the possibility of branch mispredictions


• Reducing the number of BTB entries require

Using replacement instructions instead of branch instruction:


 SETcc
 CMOVcc or FCMOVcc
!! Combine JNE (JGE etc.) and MOV instructions into one 48
BRANCH PREDICTION OPTIMIZATION
Ex.1.IF( A>B; EBX=C2 else EBX=C1)

Original instruction Optimized instruction


cmp A,B ;A-B>>flags xor ebx, ebx ;EBX=0
jge E0 ;If S=O? jump cmp A,B ;A-B=>flags
mov ebx,C1 ; EBX=C1 SETGE ebx ;If A>B,S=O,EBX=1
dec ebx ;EBX=EBX-1 ,if A>B EBX=0else EBX=FFFFh
jmp E1
and ebx, (C1-C2) ; if A>B, EBX=0 else=C1-C2
E0:
add ebx, C2 ; if A>B EBX=C2 else=C1
mov ebx,C2
E1:

Ex.2. ECX=ECX+val, IF C=1 ECX=0 else ECX=ECX+val


Original instruction Optimized instruction

XOR EBX,EBX ;EBX=0 XOR EBX,EBX ; EBX=0


ADD ECX, [Val] ;ECX=ECX+val ADD ECX,[Val] ; ECX= ECX+ val
JNC Continue ;If C=0 jump CMOVC ECX,EBX ; If C=1, ECX=EBX=0
MOV ECX,EBX ;If C=1,ECX=EBX=0 ;if C=0 continue
Continue: Continue: ; else ECX= ECX+ val50
5. Cache Memory 16KB (8k+8k)
•There are two separate
8kB caches – one for code
and one for data.
• Each cache has a separate
address translation TLB
which translates linear
addresses to physical.
•Code Cache:
–2 way set associative
cache
–256 lines b/w code
cache and prefetch
buffer, permitting
prefetching of 32 bytes
(256/8) of instructions
51
52
What is L1 and L2?
• Level-1 and Level-2 caches
• The cache memories in a computer are much
faster than DRAM
• L1 is built on the microprocessor chip itself
• L2 is a seperate chip
• L2 cache is much larger than L1 cache

Separate Code and Data caches


• On chip 8KB code and
8KB write back data cache.
•Two way set associative.
•MESI Cache protocol

53
6. Pentium Buses

Code Cache Branch


Prediction

Prefetch
Buffers Pipelined
U pipe V pipeline Floating-Point
Unit
64 bit bus Integer Integer
Interface ALU ALU

Multiply
32 bits Register Set
Add
64 bits
256 bits Divide

Data Cache
54
• The Pentium processors have a 64 bits data bus
– Pentium is a 32 bit CPU due to having 32 bits registers.
– A standard Single Transfer Cycle can read or write up to 64 bits at
a time (8 bytes – the width of DBUS)
• Burst read and burst write-back cycles are supported by the Pentium
processors
– Burst Mode cycles are used for Cache operations and transfer 32
bytes in 4 clocks (4 * 8 bytes = 4 * 64 bits=256 bits).
• 32 bytes is the size of the Pentium Cache line.
– For the Pentium, all cache operations are burst cycles.

55
• Prefetch Buffers:
– Four prefetch buffers within
the processor works as two
independent pairs.
• When instructions are
prefetched from cache, they
are placed into one set of
prefetch buffers.
• The other set is used as
when a branch operation is
predicted.
– Prefetch buffer sends a pair
of instructions to instruction
decoder

56
• Instruction Decode Unit:
It occurs in two stages – Decode1
(D1) and Decode2(D2)
-D1 checks whether instructions
can be paired
-D2 calculates the address of
memory resident operands
• Control Unit :
-This unit interprets the instruction
word and microcode entry point
fed to it by Instruction Decode Unit
-It handles exceptions, breakpoints
and interrupts.
-It controls the integer pipelines
and floating point sequences
• Microcode ROM :
-Stores microcode sequences
57

Вам также может понравиться