Вы находитесь на странице: 1из 79

Technische Universitt Mnchen

Multicore Architectures
CLGrid5 Workshop Valparaiso, Chile
September 29th , 2008

Carsten Trinitis
Lehrstuhl fr Rechnertechnik und Rechnerorganisation (LRR) Institut fr Informatik, Technische Universitt Mnchen

LRR-TUM, September 9th, 2008

How did it all evolve?


Mechanical devices

Technische Universitt Mnchen

Abacus, 3000 BC (?)

1822 Charles Babbage 1642, add & sub, Blaise Pascal

Electromechanical Machines Based on Relays


Konrad Zuse (1910-1995)

Technische Universitt Mnchen

Zuse Z3 & Z4

Technische Universitt Mnchen

Z1 / 1938, Z3 / 1941: First freely programmable machines in the world Z3 and its successor Z4 can be seen at Deutsches Museum!

Electronic Computers First Generation


No mechanical components any more Vacuum Tubes

Technische Universitt Mnchen

Principle
Basic: Triode Controllable flow within diode by a fence On / Off

1946: ENIAC machine


Electronic Numerical Integrator And Computer

ENIAC (1946)

Technische Universitt Mnchen

Organization Question:

Technische Universitt Mnchen

How to structure / organize computational machines? How to control and steer execution?

Original work (1946)


Burks, Goldstine, von Neumann: Preliminary discussion of the logical design of an electronic computing instrument.

Result:
von Neumann Architecture Most dominant architecture even today !

The IAS machine

Technische Universitt Mnchen

Developed 1952 by John von Neumann


First machine based on his design principle Institute for Advanced Studies computer

Technology Development
Vacuum tubes replaced
Transistors Smaller, more power efficient DEC PDP-1, IBM 7094 Still large machines

Technische Universitt Mnchen

Next step: Integrated Circuits


Many transistors packed on one die High density & reliability, low power IBM 360 family & first Intel chips

Many subsequent improvements

1971: 1 Microprocessor Intel 4004


~2300

st

Technische Universitt Mnchen

Transistors 108 KHz, 10000nm ,

Intel 4004 First Microprocessor


Technische Universitt Mnchen

Pentium 4 (55 Million Transistors)


Technische Universitt Mnchen

Intel Montecito

Technische Universitt Mnchen

1.7 Billion Transistors, Intel's 1st Dual Core, 90nm

28.09.08

Dual Core 2 (Woodcrest)

Technische Universitt Mnchen

290 Million Transistors , 2.4-3 GHz, 65nm

28.09.08

Core i7 (Nehalem)

Technische Universitt Mnchen

731 Million Transistors, 45nm

28.09.08

AMD Shanghai

Technische Universitt Mnchen

705 Million Transistors, 45nm

28.09.08

Intel Larrabee

Technische Universitt Mnchen

... Transistors, 45nm

28.09.08

And the Future ... ?


Scalar plus many core for highly threaded workloads

Technische Universitt Mnchen

Large, Scalar cores for high single-thread performance

Multi-core array CMP with ~10 cores Dual core Symmetric multithreading

Many-core array CMP with 10s-100s low power cores Scalar cores Capable of TFLOPS+ Full System-on-Chip Servers, workstations, embedded

Evolution

What happens inside the Core?


Time Standard Execution
FI DI DI FI FD PD WD

Technische Universitt Mnchen

FI

DI

FD PD WD

Simplified Instruction Cycle

Pipelined Execution
FI FD PD WD DI FD PD WD FI DI FD PD WD FD PD WD FD PD WD DI FD PD WD DI FD PD WD FD PD WD PD PD

Fetch Instruction Decode Instruction Fetch Data Process Data Write Data

Superscalar Execution
FI FI DI DI FI FI DI

VLIW Execution
FI

From Single- to Multi-Core

Technische Universitt Mnchen

Netburst: >30 Pipeline Stages No longer feasible... 2005: Move to dual core (and less pipeline stages) 2, 4, 6, 8, ... cores But: The free lunch is over! The good news is: This is good for parallel programmers.

Impact

Technische Universitt Mnchen

What does multi-core mean in particular? Is it just an SMP system, i.e. programmable with OpenMP, Pthreads, etc. ? Or does it differ from SMP Systems? How do multi-core systems fit into clusters?

Just an SMP system?

Technische Universitt Mnchen

Partly, but those issues will be covered by my colleagues...

Is Multi-Core different?

Technische Universitt Mnchen

Yes, with regard to memory hierarchies and interconnect!

The Memory Wall

Technische Universitt Mnchen

Performance

o perf CPU

e anc rm

Memory access
Time

Processor speed is increasing much faster than memory speed


Microprocessors: 50-100% per year (Moores law) DRAMs: 7-15% per year

The gap is widening

Caches

Technische Universitt Mnchen

Main Memory: Problems with Bandwidth & Latency


Memory bus located off-chip / on board Physical boundaries Results: Memory too far away

Cache: Memory closer to CPU which hold a subset of the main memory
+ Lower latency, higher bandwidth, On-chip - Which subset should be present? - Can we manage this transparently?

Memory Hierarchy
Compromise

Technische Universitt Mnchen

between price and performance Memory component hierarchy:


Different access speed and capacity Registers Cache Main Memory Hard Disk Archive Speed

Capacity

General Principle of Caches


Without Caches

Technische Universitt Mnchen

With Caches

CPU Slow

CPU Fast
Cache

Memory Redundant

Slow
Memory

transactions Red blocks repeated!

Terminology

Technische Universitt Mnchen

Accesses to memory can be a Cache hit: Data is in Cache Cache miss: Data has to be retrieved from memory Cache misses are expensive! Cache size: Total size of Cache Cache line size/length: Caches do not store individual bytes/words Management overhead too high Unit of storage: Cache lines
Consecutive number of bytes / memory

Terminology

Technische Universitt Mnchen

Replacement policy: Which cache line to evict if new space is needed? Optimal: Data not used in the near future Make prediction from the past Often used: Least recently used (LRU) How are writes treated? Write back caching
Writes are stored in caches Data is written back in case of line eviction

Write through caching


Data is written directly to main memory

Cache Associativity

Technische Universitt Mnchen

Caches are a collection of cache lines


Equally sized & Much smaller than memory

Question of mapping between CLs and memory


Where to look for a cache hit? Where to put a newly loaded cache line?

Free mapping is very costly


Difficult lookup function for cache accesses Target CLs for a particular access restricted Only a certain number of CLs possible
Associativity of a cache

Cache Structures
Where is a block stored in the cache?
MainMemory BlockB j j=0,1,...,(n1)

Technische Universitt Mnchen

1 1 1 1 1 1 1 1 1 12 2 2 2 2 2 2 2 2 2 3 3 01 234 567 890 123 456 789 012 345 678 901

Capacity: n*b=2 s+wWords

Mappingfrom{B j }to{Z i } Cache BlockZ i (CacheLine) i=0,1,...(m1) Capacity: m*b=2 r+w Words
01 234 567

n>>m,n=2 s,m=2 r Eachblockcontainsbwords withb=2 w

Cache Structures

Technische Universitt Mnchen

Direct-Mapped Cache
Direct mapping of n/m = 2s-r memory blocks into one cache line:
Mapping: Bj --> Zi, where i = j mod m
Cache Zeile Line
Tag Zeile Tag Z1 Z0

Main Memory Haupspeicher Block B0


B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15

Tag

Z2

Tag

Z3

Cache Structures
Direct-Mapped

Technische Universitt Mnchen

Cache

Low hardware complexity. Fixed mapping block line yields fixed replacement strategy.

Cache Structures

Technische Universitt Mnchen

Fully Associative Cache Any block in main memory can be mapped to any cache line (flexibility). Replacement strategy tells which line is to be overwritten when loading the cache (e.g. LeastRecently-Used). High hardware complexity.

Cache Structures

Technische Universitt Mnchen

Set Associative Cache Compromise between Direct-Mapped- and fully associative Cache. k-way set associative cache:
k lines form one set. m cache-lines are divided into v = m/k sets with k each.

Programmability

Technische Universitt Mnchen

Caches have no impact (from a logical point of view)


Designed to be transparent

BUT: large performance impact


Need to use caches efficiently
E.g. Try to reuse data in caches

HPC Applications need to be tailored to caches


Adapt to cache sizes, cache line sizes, and hierarchies Good understanding of architecture required Significant performance gains possible!

Example

Technische Universitt Mnchen

Parameters (taken from typical L1 Cache)


32 KB size Cache line size 32 Bytes Cache has 1024 cache lines

Address format:
Rest 10 bit CL select 5 bit / CL offset

Example (cont.)

Technische Universitt Mnchen

Full Associativity (m-way associativity)


New cache line can be stored in any of the 1024 CL Selection e.g. by Least Recently Used (LRU)

Direct mapped (1-way associativity)


New cache line can only be placed in defined line

2-way associativity
Only use 9 bits for CL selection, i.e for 512 sets Selection within set can again be done e.g. using LRU

Example (cont.)

Technische Universitt Mnchen

4-way associativity
Only use 8 bits for CL selection, i.e for 256 sets Selection can again be done e.g. using LRU

Summary: With increasing associativity


Possible number of target CLs increases More flexibility, less chance of unwanted evictions BUT: Implementation more complex

Cache Hierarchies

Technische Universitt Mnchen

Caches are layered


Several levels of laches Each level works independently Transparency still maintained Currently up to 3 levels

CPU

L1 Cache

L2 Cache

Higher levels
Slower, but larger

L3 Cache

Main Memory

Instruction vs. Data Caches

Technische Universitt Mnchen

L1 Caches are often split


I-Cache for Instructions D-Cache for Data

Reduces conflicts
Significantly different access patterns

Allows additional optimizations


Processor layout (CPU design) Make use of the special access patterns
Example: Trace Caches for I-Cache Store longer instruction sequences/traces

Cache Optimization

Technische Universitt Mnchen

Why does cache architecture have an impact on performance?


Data should be reused as much as possible! Locality of reference:
Temporal locality: recently accessed data is likely be be accessed in the future. Spatial locality: Data located closely together is likely to be accessed closely together in time.

Cache Optimization

Technische Universitt Mnchen

How can this be optimized?


Code transformations: Change order of loop iteration executions. Must not change numerical results! Maintain data dependencies!

Cache Optimization

Technische Universitt Mnchen

Loop interchange:

Stride = 8

Stride = 1

Other Techniques

Technische Universitt Mnchen

Prefetching
Try to preload data that will potentially be used Pro: Data can be pre-requested Con: May waste bandwidth / not used loads

Controlled by Hardware
Speculative loads

Controlled by programmer / compiler


Insert the prefetching statements into the code Traditionally disturbed pipeline! Can be used with multi-core processors with shared cache!

Early CMPs
Intel Montecito Intel Pentium-D AMD Dual Core Opteron IBM Cell

Technische Universitt Mnchen

Intel Montecito

Technische Universitt Mnchen

Intel Pentium-D

Technische Universitt Mnchen

Early CMPs
IBM / Sony / Toshiba Cell Processor:

Technische Universitt Mnchen

1 Power Processor Element (PPE) 8 Synergistic Processing Elements (SPE) Element Interface Bus (EIB), 384GB/s 25,6 GB/s memory bandwidth 50-80 Watts energy consumption

Cell Processor

Technische Universitt Mnchen

SUN UltraSparc T1

Technische Universitt Mnchen

Eight cores, connected via Crossbar 134 GB/s Each core can process 4 threads 25,6GB/s memory bandwidth 70 Watts energy consumption => 2 Watts/Thread

Trends through Multi-Core

Technische Universitt Mnchen

Computers move into chip! New memory hierarchies ==> Caches! New interconnect topologies. Three levels of parallelism: On-chip On-board Cluster

28.09.08

Trends through Multi-Core

Technische Universitt Mnchen

Computers move into chip! New memory hierarchies ==> Caches! New interconnect topologies. Three levels of parallelism: On-chip On-board Cluster

28.09.08

Contemporary Multicore Chips

Technische Universitt Mnchen

Intel Clovertown/Penryn: 4 Cores Split L1 Cache Partly Shared L2 Cache! FSB

28.09.08

Contemporary Multicore Chips

Technische Universitt Mnchen

AMD Barcelona: 4 Cores Split L1/L2 Cache Shared L3 Cache! On Chip Crossbar

28.09.08

Contemporary Multicore Chips


SUN Niagara 2 2 Cores 4 Threads / Core 32 Threads On Chip Crossbar IBM Power 5 / Power 6

Technische Universitt Mnchen

28.09.08

Upcoming Archs: Dunnington

Technische Universitt Mnchen

28.09.08

Core i7 (Nehalem)

Technische Universitt Mnchen

731 Million Transistors, 45nm

28.09.08

Nehalem: Intel's Next Generation


Technische Universitt Mnchen

28.09.08

AMD Shanghai

Technische Universitt Mnchen

705 Million Transistors, 45nm

28.09.08

Technische Universitt Mnchen

Larrabee: Intel's Many-Core Architecture

28.09.08

Technische Universitt Mnchen

Larrabee: Intel's Many-Core Architecture

Plenty of x86 in-order cores plus standard 64bit extensions 16 wide SIMD unit per core Fully coherent L1 (32KB) /L2 (256KB) caches Bidirectional ring bus Short in order pipeline 4-way SMT
28.09.08

Technische Universitt Mnchen

Larrabee vs. Core

28.09.08

Technische Universitt Mnchen

Larrabee: Intel's Many-Core Architecture

Shared Memory Programming Model:


Pthreads OpenMP Prromises to be standard conform

C / FORTRAN Compiler Key advantage: x86 binary compatibility!

28.09.08

Technische Universitt Mnchen

autopin:
A Tool for automatic Optimization of Pinning Processes in Multicore Architectures

28.09.08

Motivation: Many possibilities

Technische Universitt Mnchen

28.09.08

66

can lead to non-deterministic runtimes...

Technische Universitt Mnchen

... but not necessarily!

Technische Universitt Mnchen

Technische Universitt Mnchen

The autopin Approach


User-level tool Start multi-threaded application under autopin control User can specify pinnings of interest Pin threads to cores Assess performance of chosen pinning using performance counters Try alternative pinnings until optimal pinning is found

28.09.08

69

Performance Counters
Multiple Event Sensors
ALU Utilization Branch Prediction Cache Events (L1/L2/TLB) Bus Utilization

Technische Universitt Mnchen

Two Uses:
Read: Get Precise Count of Events in Code Regions => Counting Interrupt on Overflow => Statistical Sampling

Well-known tools:
Oprofile Perfctr Intel Vtune Perfmon2

Technische Universitt Mnchen

perfmon2
Kernel-Patch + library (libpfm) Generic interface for PMU access Portable: implementations for IA32, x64, IA64, MIPS, Power Allows for per-thread and system-wide monitoring Support for counting and sampling pfmon:
attach to running threads fork new processes and attach to them fully exploit performance counters

autopin Strategy

Technische Universitt Mnchen

numOfPinnings = 3; pinning = {"1984", "182B", "58BE"}; for (i=0; i<numOfPinnings; i++) { pinThreads(pinning[i]); runThreads(warmupTime); p1 = readPerformanceCounter(); runThreads(sampleTime); p2 = readPerformanceCounter(); performanceRate[i] = (p2-p1)/sampleTime; } pinThreads(bestPinning);

Technische Universitt Mnchen

Experimental Setup
Caneland:
Intel Tigertown: Quad-Core, 2x4MB L2/socket, 2.93GHz clock rate 4-way, 4x1066MHz FSB, 64MB snoop filter, UMA

Clovertown:
Intel Clovertown: Quad-Core, 2x4MB L2, 2.66GHz clock rate 2-way, 1x1333MHz FSB, UMA

Barcelona:
AMD K10: Quad-Core, 4x512kB L2, 1x2MB L3, 1.9GHz clock rate 2-way, 1000MHz Hypertransport, NUMA

Linux Kernel 2.6.23 with perfmon2 patches SPEC OMP Benchmark Intel Compiler Suite

Technische Universitt Mnchen

Caneland

Technische Universitt Mnchen

Barcelona

Technische Universitt Mnchen

Results
Caneland 2 310.wupwise 312.swim 314.mgrid 316.applu 320.equake 324.apsi 328.fma3d 330.art 332.ammp 4 8 Clovertown 2 4 Barcelona 2 4

Technische Universitt Mnchen

Conclusions and Outlook


Single core, non parallel systems will disappear Profound knowledge of parallel programming will be required to fully exploit multi- and manycore systems In addition, knowledge about: Cache / cache hierarchies
On chip interconnect Memory hierarchies

is essential!

28.09.08

77

Technische Universitt Mnchen

Conclusions and Outlook 2


Pinning is essential!

We will see more and more GPU features in main processors!

28.09.08

78

Technische Universitt Mnchen

Gracias! Thank you!

28.09.08

79

Вам также может понравиться