Вы находитесь на странице: 1из 23

Multi-Core Processor Technology: Maximizing CPU Performance in a Power-Constrained World

Paul Teich Business Strategy CPG Server/Workstation paul.teich @ amd.com AMD

The Issues Silicon designers can choose a variety of methods to increase processor performance Commercial end-customers are demanding
More capable systems with more capable processors That new systems stay within their existing power/thermal infrastructure

Processor frequency and power consumption seem to be scaling in lockstep How can the industry-standard PC and Server industries stay on our historic performance curve without burning a hole in our motherboards? This session is not about process technology

Session Outline Definition: What is a processor? Core Design System Architecture Manufacturing, Power, and Thermals Multi-Core Processor Architecture Performance Impacts

What is a Processor? A single chip package that fits in a socket 1 core (not much point in <1 core)
Cores can have functional units, cache, etc. associated with them, just as today Cores can be fast or slow, just as today

Shared resources
More cache Other integration: Northbridge, memory controllers, high-speed serial links, etc.

One system interface no matter how many cores


Number of signal pins doesnt scale with number of cores

A Representative Multi-Core Processor


Dual-core AMD Opteron processor is 199mm2 in 90nm Single-core AMD Opteron processor is 193mm2 in 130nm

Multi-Core Processor Architecture

Core Design
Frequency
Is only as good as the rest of the core architecture
Fetch L1 Icache 64KB

Branch Prediction

AMD Opteron processor core architecture

Scan/Align/Decode Microcode Engine Fastpath


ops

L1 Dcache 64KB

Instruction Control Unit (72 entries)

Int Decode & Rename


Res Res AGU ALU Res AGU ALU

FP Decode & Rename 36-entry FP scheduler


FADD FMUL FMISC

44-entry Load/Store Queue

AGU ALU MULT

Core Design Functional units


Superscalar is known territory Diminishing returns for adding more functional blocks Alternatives like VLIW have been considered and rejected by the market Single-threaded architectural performance is pegged

Data paths
Increasing bandwidth between functional units in a core makes a difference
Such as comprehensive 64-bit design, but then where to?

Core Design
Pipeline
Deeper pipeline buys frequency at expense of increased cache miss penalty and lower instructions per clock Shallow pipeline gives better instructions per clock at the expense of frequency scaling Max frequency per core requires deeper pipelines Industry converging on middle ground9 to 11 stages
Successful RISC CPUs are in the same range

Cache
Cache size buys performance at expense of die size, its a direct hit to manufacturing cost Deep pipeline cache miss penalties are reduced by larger caches Not always the best match for shallow pipeline cores, as cache misses penalties are not as steep

Manufacturing Moores Law isnt dead, more transistors for everyone!


Butit doesnt really mention scaling transistor power

Chemistry and physics at nano-scale


Stretching materials science Voltage doesnt scale yet Transistor leakage current is increasing

As manufacturing economies and frequency increase, power consumption is increasing disproportionately There are no process or architectural quick-fixes

Transistors Are Not Free The number of transistors in a core determines basic power consumption Architectural efficiency matters a lot when designing new cores
More functional units means more transistors Deeper pipelines mean more transistors Larger caches mean more transistors

Static Current vs. Frequency


Non-linear as processors approach max frequency
15

Static Current

Very High Leakage and Power Embedded Parts Fast, Low Power Fast, High Power

1.0

Frequency

1.5

Power vs. Frequency


In AMDs process, for 200MHz frequency steps, two steps back on frequency cuts power consumption by ~40% from maximum frequency
2.0

Power Consumption

1.5 1.0 0.5 n-5 n-4 n-3 n-2 n-1 n

Frequency
(Gross relative numbers summarized from a mountain of real data)

Thermal Density Decreases


Hot spots
Twice as many as in single-core Farther apart than in single-core With freq delta, cooler than in single-core

CA same for single-core at n and dual-core at n-2


Larger die spreads heat more evenly in package Use identical heat sink, slightly better cooling with dual-core Works for this processor generation and next, CA changes over major generations

Thermal diode accuracy becomes an issue with dual-core

Total Effect on Dual-Core Frequencies Substantially lower power with lower frequency Thermals easier to handle at any frequency Result is dual-core running at n-2 in same thermal envelope as single-core running at top speed

Multi-Core Processor Architecture


Why integrate?
Most functions are really small compared to the cores and cache All integrated logic runs at core frequency regardless of I/O speeds

What to integrate?
Northbridge crossbar switch is key
Look for innovation and differentiation in how cores are connected on-chip Must integrate Northbridge to integrate anything else

Memory controller to reduce memory latency and further reduce the need for cache High-speed serial links for system I/O

What not to integrate?


Most Southbridge functions Graphics

AMD Opteron Processor Integrated Northbridge


CPU 0 CPU 1 CPU 0 CPU 1 CPU 0 CPU 1 Data Data Probes Probes Requests Requests CPU 0CPU 1 Int Int

System Request Interface (SRI)

Advanced Programmable Interrupt Controller (APIC)

64-bit Data 64-bit Command/Address 16-bit Data/Command/Address

Crossbar (XBAR)

Memory Controller (MCT)

DRAM Controller (DCT)

HyperTransport HyperTransport HyperTransport Link 0 Link 1 Link 2

RAS/CAS/Cntl

DRAM Data

Multi-Core: Where Processor and System Collide


Scales performance
Dedicated resources for two simultaneous threads Multiple cores will contend for memory and I/O bandwidth
Northbridge is the bottleneck Integrating Northbridge eliminates much of bottleneck Northbridge architecture has significant impact on performance

Cores, cache and Northbridge must be balanced for optimal performance

More aggregate performance for:


Multi-threaded apps Transactions: many instances of same app Multi-tasking

Thread scheduling handled by OS


BIOS notifies Windows of thread execution resources

Early Benchmark Estimates


Decoder
2P/2C 2 proc. single-core 4P/4C 4 proc. single-core 2P/4C 2 proc. dual-core 4P/8C 4 proc. dual-core
SPECint_rate2000 Peak Win SP1
Platforms
4P/8C 4P/4C 2P/4C 2P/2C

308 184 163 100

% Comparison to Base 2P/2C

Frequencies
Platforms

Single-core = 2.4GHz Dual-core = 2.0GHz

OLTP Workload
4P/8C 4P/4C 2P/4C 2P/2C

244 165 148 100

Identical system configs

Memory, disks, network, etc. % Comparison to Base 2P/2C Early dual-core validation SPEC and the benchmark name SPECint are registered system used, different trademarks of the Standard Performance Evaluation motherboards Corporation. SPEC scores for AMD Opteron Model 270
and 870 based systems are estimated

Call to Action Most application software doesnt need to do anything to benefit from dual-core Be aware that, for a processor within a given power envelope
Fewer cores will clock faster than more cores
Single-threaded performance-sensitive applications

More cores will out-perform fewer cores for


Multi-threaded applications Multi-tasking response times Transaction processing

Processor architecture impacts multi-core performance


Process technology is only the ante Integration enables a balanced high-performance architecture

Community Resources
Windows Hardware & Driver Central (WHDC)
www.microsoft.com/whdc/default.mspx

Technical Communities
www.microsoft.com/communities/products/default.mspx

Non-Microsoft Community Sites


www.microsoft.com/communities/related/default.mspx

Microsoft Public Newsgroups


www.microsoft.com/communities/newsgroups

Technical Chats and Webcasts


www.microsoft.com/communities/chats/default.mspx www.microsoft.com/webcasts

Microsoft Blogs
www.microsoft.com/communities/blogs

Additional Resources Email: paul.teich @ amd.com WinHEC Presentations


x86 Everywhere, Chris Herring, AMD Maximizing Desktop Application Performance on Dual-Core PC Platforms, Rich Brunner, AMD

Web Resources
AMD http://www.amd.com/ AMD Multi-Core http://www.amd.com/multicore/ AMD Opteron Processor http://www.amd.com/opteron/ AMD Multi-Core White Paper
http://enterprise.amd.com/downloadables/33211A_Multi-Core_WP.pdf

HyperTransport Consortium http://www.hypertransport.org/

Вам также может понравиться