Вы находитесь на странице: 1из 23

Itanium™ Processor Microarchitecture Overview

Intel® Itanium™ Processor


Microarchitecture Overview

Harsh Sharangpani
Principal Engineer and IA-64 Microarchitecture Manager
Intel Corporation
®

Microprocessor Forum October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

Unveiling the Intel® Itanium™


Processor Design

l Leading-edge implementation of
IA-64 architecture for world-class
performance
l New capabilities for systems that fuel
the Internet Economy
l Strong progress on initial silicon

Microprocessor Forum 2 October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

Itanium™ Processor Goals


l World-class performance on high-end applications
– High performance for commercial servers
– Supercomputer-level floating point for technical
workstations
l Large memory management with 64-bit addressing
l Robust support for mission critical environments
– Enhanced error correction, detection & containment
l Full IA-32 instruction set compatibility in hardware
l Deliver across a broad range of industry requirements
– Flexible for a variety of OEM designs and operating systems

Deliver world-class performance and features for


servers & workstations and emerging internet applications
®

Microprocessor Forum 3 October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

EPIC Design Philosophy


ì Maximize performance via
EPIC hardware & software synergy
ì Advanced features enhance
instruction level parallelism
ìPredication, Speculation, ...
ì Massive hardware resources
for parallel execution
VLIW OOO / SuperScalar
ì High performance EPIC
Performance

building block
RISC

CISC
Time

Achieving performance at the most


fundamental level
®

Microprocessor Forum 4 October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

Itanium™ EPIC Design Maximizes SW-HW Synergy


Architecture Features programmed by compiler:
Register Data & Control
Branch Explicit Memory
Stack Predication
Hints Parallelism Speculation Hints
& Rotation

Micro-architecture Features in hardware:

Fetch Issue Register Control Parallel Resources Memory


Handling Subsystem

Bypasses & Dependencies


4 Integer +
4 MMX Units
Fast, Simple 6-Issue

128 GR &
Instruction 128 FR, 2 FMACs Three
Cache Register (4 for SSE) levels of
& Branch Remap cache:
Predictors &
2 LD/ST units L1, L2, L3
Stack
Engine
32 entry ALAT

® Speculation Deferral Management


Microprocessor Forum 5 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Breakthrough Levels of Parallelism


6 instructions
M F I M F I
provides
12 parallel ops/clock
(SP: 20 parallel ops/clock)
•Load 4 DP (8 SP) 2 ALU ops for digital content creation
ops via 2 ld-pair & scientific computing
•2 ALU oper 4 DP FLOPS
(post incr) (8 SP FLOPS)
6 instructions
M I B M I B provides
8 parallel ops / clock
for enterprise &
Internet applications
2 Loads + 1 Branch Hint +
2 ALU ops 2 ALU ops 1 Branch instr
(post incr)
Itanium™ delivers greater instruction level
®
parallelism than any contemporary processor
Microprocessor Forum 6 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Highlights of the Itanium™ Pipeline


l 6-Wide EPIC hardware under precise compiler control
– Parallel hardware and control for predication & speculation
– Efficient mechanism for enabling register stacking & rotation
– Software-enhanced branch prediction
l 10-stage in-order pipeline with cycle time designed for:
– Single cycle ALU (4 ALUs globally bypassed)
– Low latency from data cache
l Dynamic support for run-time optimization
– Decoupled front end with prefetch to hide fetch latency
– Aggressive branch prediction to reduce branch penalty
– Non-blocking caches and register scoreboard to hide load
latency

Parallel, deep, and dynamic pipeline


®
designed for maximum throughput
Microprocessor Forum 7 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

10 Stage In-Order Core Pipeline


Front End Execution
• Pre-fetch/Fetch of up • 4 single cycle ALUs, 2 ld/str
to 6 instructions/cycle • Advanced load control
• Hierarchy of branch • Predicate delivery & branch
predictors • Nat/Exception//Retirement
• Decoupling buffer

WORD-LINE
EXPAND RENAME DECODE REGISTER READ

IPG FET ROT EXP REN WLD REG EXE DET WRB
INST POINTER FETCH ROTATE EXECUTE EXCEPTION WRITE-BACK
GENERATION DETECT

Instruction Delivery Operand Delivery


• Dispersal of up to 6 • Reg read + Bypasses
instructions on 9 ports • Register scoreboard
• Reg. remapping • Predicated
• Reg. stack engine dependencies
®

Microprocessor Forum 8 October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

Frontend: Prefetch & Fetch IPG FET ROT

l SW-triggered prefetch loads target code early using br hints


l Streaming prefetch of large blocks via hint on branch
l Early prefetch of small blocks via BRP instruction
l I-Fetch of 32 Bytes/clock feeds an 8-bundle decoupling buffer
l Buffer allows front-end to fetch even when back-end is stalled
l Hides instruction cache misses and branch bubbles

8 bundle buffer
IP MUX

I-Cache & ITLB Fetch


bubble Feed

Branch Predictor Structures & Resteers

IPG FET ROT EXP


Aggressive instruction fetch hardware to feed a
®
highly parallel, high performance machine
Microprocessor Forum 9 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Front End: Branch Prediction IPG FET ROT

l Branch hints combine with predictor hierarchy to improve


branch prediction: four progressive resteers
l 4 TARs programmed by “importance” hints
l 512-entry 2-level predictor provides dynamic direction prediction
l 64-entry BTAC contains footprint of upcoming branch targets
(programmed by branch hints, and allocated dynamically)

I-Cache & ITLB 8 bundle buffer


IP MUX

Loop Exit
Return Stack Buffer Corrector

Target Adaptive Br Target Branch Branch


Address 2-Level Address Address Address
Registers Predictor Cache Calc 1 Calc 2

IPG FET ROT EXP


Intelligent branch prediction improves
®
performance across all workloads
Microprocessor Forum 10 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Instruction Delivery: Dispersal EXP

l Stop bits eliminate dependency checking


l Templates simplify routing
l 1st available dispersal from 6 syllables to 9 issue ports
– Keep issuing until stop bit, resource oversubscription, or asymmetry

M0
S0 M1
S1
S2 I0
I1
Dispersal
Network F0
F1
S3
S4
S5 B0
B1
B2
ROT EXP

Achieves highly parallel execution with simple hardware


®

Microprocessor Forum 11 October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

Instruction Delivery: Stacking REN

l Massive 128 register file accommodates multiple variable


sized procedures via stacking
l Eliminates most register spill / fill at procedure interfaces
l Achieved transparently to the compiler
l Using register remapping via parallel adders
l Stack engine performs the few required spill/fills
Stall
Stack Engine
M0
Spill/Fill Injection
M1
Integer, FP,
I0 & Predicate
I1 Renamers
F0
EXP F1 REN WLD

Unique register model enables faster


®
execution of object-oriented code
Microprocessor Forum 12 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Operand Delivery WLD REG

l Multiported register file + mux hierarchy delivers operands in REG


l Unique “Delayed Stall” mechanism used for register dependencies
l Avoids pipeline flush or replay on unavailable data
l Stall computed in REG, but core pipeline stalls in EXE
l Special Operand Latch Manipulation (OLM) captures data returns
into operand latches, to mimic register file read

Bypass
128 Entry Integer

Muxes

ALUs
Src
Register File
8R / 6W
WLD REG EXE
Src Src Dependency Control OLM comparators
Scoreboard
Comparators Delayed Stall
Dst Preds

Avoids pipeline flush to enable a more


®
effective, higher throughput pipeline
Microprocessor Forum 13 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Predicate Delivery EXE

l All instructions read operands and execute


l Canceled at retirement if predicates off
l Predicates generated in EXE (by cmps), delivered in DET, &
feed into: retirement, branch execution and dependency detection
l Smart control network cancels false stalls on predicated
dependencies
l Dependency detection for cancelled producer/consumer (REG)
REG EXE DET
To Dependency Detect (x6)

Bypass
Muxes
Predicate
Register To Branch Execution (x3)
File Read
To Retirement (x6)
I-Cmps

F-Cmps

Higher performance through removal of branch


®
penalties in server and workstation applications
Microprocessor Forum 14 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Parallel Branch Execution DET WRB

l Speculation + predication result in clusters of branches


l Execution of 3 branches/clock optimizes for clustered branches
l Branch execution in DET allows cmps-->branches in same issue
group

REG EXE DET WRB

Pred. Delivery 3 Retirement


BR Read
of branch
bundle
IP
relative Address Direction Resteer
Validation Validation IP

Most recent branch prediction info

Parallel branch hardware extends performance


® benefits of EPIC technology
Microprocessor Forum 15 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Speculation Hardware DET WRB

l Control Speculation support requires minimal hardware


l Computed memory exception delivered with data as tokens (NaTs)
l NaTs propagate through subsequent executions like source data
l Data Speculation enabled efficiently via ALAT structure
l 32 outstanding advanced loads
l Indexed by reg-ids, keeps partial physical address tag
l 0 clk checks: dependent use can be issued in parallel with check
EXE DET WRB
Physical
Address 32-entry
Adv Ld Status

Exception
Address TLB ALAT Check

Logic
& Exception
Memory Spec. Ld. Status (NaT)
Subsystem

Check Instruction

®
Efficient elimination of memory bottlenecks
Microprocessor Forum 16 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Floating Point Features


l Native 82-bit hardware provides support for multiple numeric models
l 2 Extended precision pipelined FMACs deliver 4 EP / DP FLOPs/cycle
l Performance for security and 3-D graphics
l 2 Additional single-precision FMACs for 8 SP FLOPs/cycle (SIMD)
l Efficient use of hardware: Integer multiply-add and s/w divide
l Balanced with plenty of operand bandwidth from registers / memory
2 stores/clk 6 x 82-bit operands

4Mbyte even
128 entry
L3 L2 82-bit
Cache Cache odd RF

2 DP 4 DP
Ops/clk Ops/clk
(2 x Fld-pair) 2 x 82-bit results

Itanium™ processor delivers industry-leading


®
floating point performance
Microprocessor Forum 17 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Reliability & Availability Features


l Extensive Parity/ECC coverage on processor and bus
– L3 MESI state bits sparsely encoded to protect the M-state
– Frontside bus uses special ECC encoding for consecutive 4-bit errors

ITLB
Front Side Bus
Back Side Bus
Data
Data
L1I
L2 L3 Data
Data
L1D

L2 Tag L3 Tag Front Side Bus


DTLB
Back Side Bus Command/Address
Command/Address

1x ECC Correction, 2 x ECC detection


Parity coverage w/ enhanced MCA

Comprehensive integrity for high-end applications


®

Microprocessor Forum 18 October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

Enhanced Machine Check


Architecture
Error Type Signaling Example Benefit

Enhanced
Corrected by CPU;
CMCI 1xECC L2 data Reliability &
current process continues
Availability
CONTINUE
Corrected by firmware; Enhanced
current process CMCI I-cache parity Reliability &
continues Availability

Affected process
terminated by f/w to OS; Enhanced
RECOVER LMCA Poisoned data
OS is stable Availability

Error is contained, System Bus Enhanced


CONTAIN affected node is taken GMCA Reliability
Address parity
off-line

Itanium™ processor delivers the


®
mission-critical reliability required by E-business
Microprocessor Forum 19 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

IA-32 Compatibility
l Itanium™ directly executes IA-32 binary code
– Shared caches & execution core increases area efficiency
– Dynamic scheduler optimizes performance on legacy binaries
l Seamless Architecture allows full Itanium
performance on IA-32 system functions

Compatibility IA-32 Dynamic IA-32


Retirement &
Fetch & Decode Scheduler Exceptions

Shared
Shared I-Cache Execution
Core

Full, efficient IA-32 instruction


® compatibility in hardware
Microprocessor Forum 20 October 5-6, 1999
Itanium™ Processor Microarchitecture Overview

Intel® Itanium™ Processor Block Diagram


ECC L1 Instruction Cache and
Fetch/Pre-fetch Engine
ITLB
IA-32
Branch Instruction
8 bundles Decode
Prediction Queue and
Control
9 Issue Ports B B B M M I I F F

Register Stack Engine / Re-Mapping


L2 Cache

L3 Cache
Branch & Predicate
128 Integer Registers 128 FP Registers
Scoreboard, Predicate

Registers
NaTs,, Exceptions

Branch Integer Dual-Port

ALAT
Units and L1
,NaTs

MM Units Data Floating


Cache Point
and Units
DTLB SIMD
ECC SIMD
FMAC
FMAC

ECC ECC
Bus Controller ECC
ECC

Microprocessor Forum 21 October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

Itanium™ Processor Status


l Solid progress in weeks following Itanium™ first silicon
– More than 4 operating systems running today
– Demonstrated 64-bit Windows 2000 and Linux running apps
– Initial engineering samples shipped to OEMs
l Comprehensive functional validation underway
– Thorough pre-silicon functional testing included OS kernel on Itanium
logic model
– Testing including 7 OS’s & many key enterprise and scientific apps
– Multiple Intel and OEM test platform configurations (from 2 - 64 processors)

l Planned steps to production in mid 2000


– Completion of functional testing phase through end of 1999
– Performance testing/tuning accelerates in 1H’00
– Broad prototype system deployment by Intel and OEM’s early 2000

Microprocessor Forum 22 October 5-6, 1999


Itanium™ Processor Microarchitecture Overview

Intel® Itanium™ Processor Summary


l High performance leading-edge design
– EPIC technology provides a breakthrough in hardware/software synergy
– Predication, speculation, register stacking, & large L3 for
High-End servers
– Supercomputer-level GFLOPs performance for technical workstations
– 64-bit memory addressability for large data sets
l Mission-critical reliability and availability
– Machine check implementation maximizes error containment and
correction
– Comprehensive data integrity for e-Business, Internet and
enterprise servers
l Full IA-32 instruction level compatibility in hardware
l Strong Itanium™ silicon progress and industry support

Microprocessor Forum 23 October 5-6, 1999

Вам также может понравиться