Multicore Architecture CTrinitis

Technische Universitt Mnchen
Multicore Architectures
CLGrid5 Workshop Valparaiso, Chile
September 29th , 2008
Carsten Trinitis
Lehrstuhl fr Rechnertechnik und Rechnerorganisation (LRR) Institut fr Informatik, Technische Universitt Mnchen
LRR-TUM, September 9th, 2008
How did it all evolve?

Mechanical devices
Abacus, 3000 BC (?)
1822 Charles Babbage 1642, add & sub, Blaise Pascal
Electromechanical Machines Based on Relays

Konrad Zuse (1910-1995)
Zuse Z3 & Z4
Z1 / 1938, Z3 / 1941: First freely programmable machines in the world Z3 and its successor Z4 can be seen at Deutsches Museum!
Electronic Computers First Generation

No mechanical components any more Vacuum Tubes
Principle
Basic: Triode Controllable flow within diode by a fence On / Off
1946: ENIAC machine

Electronic Numerical Integrator And Computer
ENIAC (1946)
Organization Question:
How to structure / organize computational machines? How to control and steer execution?
Original work (1946)

Burks, Goldstine, von Neumann: Preliminary discussion of the logical design of an electronic computing instrument.
Result:
von Neumann Architecture Most dominant architecture even today !
The IAS machine
Developed 1952 by John von Neumann

First machine based on his design principle Institute for Advanced Studies computer
Technology Development
Vacuum tubes replaced
Transistors Smaller, more power efficient DEC PDP-1, IBM 7094 Still large machines
Next step: Integrated Circuits

Many transistors packed on one die High density & reliability, low power IBM 360 family & first Intel chips
Many subsequent improvements
1971: 1 Microprocessor Intel 4004

~2300
st
Transistors 108 KHz, 10000nm ,
Intel 4004 First Microprocessor

Pentium 4 (55 Million Transistors)

Intel Montecito
1.7 Billion Transistors, Intel's 1st Dual Core, 90nm
28.09.08
Dual Core 2 (Woodcrest)
290 Million Transistors , 2.4-3 GHz, 65nm
28.09.08
Core i7 (Nehalem)
731 Million Transistors, 45nm
28.09.08
AMD Shanghai
28.09.08
Intel Larrabee
... Transistors, 45nm
28.09.08
And the Future ... ?

Scalar plus many core for highly threaded workloads
Large, Scalar cores for high single-thread performance
Multi-core array CMP with ~10 cores Dual core Symmetric multithreading
Many-core array CMP with 10s-100s low power cores Scalar cores Capable of TFLOPS+ Full System-on-Chip Servers, workstations, embedded
Evolution
What happens inside the Core?

Time Standard Execution
FI DI DI FI FD PD WD
FI
DI
FD PD WD
Simplified Instruction Cycle
Pipelined Execution
FI FD PD WD DI FD PD WD FI DI FD PD WD FD PD WD FD PD WD DI FD PD WD DI FD PD WD FD PD WD PD PD
Fetch Instruction Decode Instruction Fetch Data Process Data Write Data
Superscalar Execution
FI FI DI DI FI FI DI
VLIW Execution
FI
From Single- to Multi-Core
Netburst: >30 Pipeline Stages No longer feasible... 2005: Move to dual core (and less pipeline stages) 2, 4, 6, 8, ... cores But: The free lunch is over! The good news is: This is good for parallel programmers.
Impact
What does multi-core mean in particular? Is it just an SMP system, i.e. programmable with OpenMP, Pthreads, etc. ? Or does it differ from SMP Systems? How do multi-core systems fit into clusters?
Just an SMP system?
Partly, but those issues will be covered by my colleagues...
Is Multi-Core different?
Yes, with regard to memory hierarchies and interconnect!
The Memory Wall
Performance
o perf CPU
e anc rm
Memory access
Time
Processor speed is increasing much faster than memory speed

Microprocessors: 50-100% per year (Moores law) DRAMs: 7-15% per year
The gap is widening
Caches
Main Memory: Problems with Bandwidth & Latency

Memory bus located off-chip / on board Physical boundaries Results: Memory too far away
Cache: Memory closer to CPU which hold a subset of the main memory
+ Lower latency, higher bandwidth, On-chip - Which subset should be present? - Can we manage this transparently?
Memory Hierarchy
Compromise
between price and performance Memory component hierarchy:

Different access speed and capacity Registers Cache Main Memory Hard Disk Archive Speed
Capacity
General Principle of Caches

Without Caches
With Caches
CPU Slow
CPU Fast
Cache
Memory Redundant
Slow
Memory
transactions Red blocks repeated!
Terminology
Accesses to memory can be a Cache hit: Data is in Cache Cache miss: Data has to be retrieved from memory Cache misses are expensive! Cache size: Total size of Cache Cache line size/length: Caches do not store individual bytes/words Management overhead too high Unit of storage: Cache lines
Consecutive number of bytes / memory
Terminology
Replacement policy: Which cache line to evict if new space is needed? Optimal: Data not used in the near future Make prediction from the past Often used: Least recently used (LRU) How are writes treated? Write back caching
Writes are stored in caches Data is written back in case of line eviction
Write through caching

Data is written directly to main memory
Cache Associativity

Caches are a collection of cache lines

Equally sized & Much smaller than memory
Question of mapping between CLs and memory

Where to look for a cache hit? Where to put a newly loaded cache line?
Free mapping is very costly

Difficult lookup function for cache accesses Target CLs for a particular access restricted Only a certain number of CLs possible
Associativity of a cache
Cache Structures
Where is a block stored in the cache?
MainMemory BlockB j j=0,1,...,(n1)
1 1 1 1 1 1 1 1 1 12 2 2 2 2 2 2 2 2 2 3 3 01 234 567 890 123 456 789 012 345 678 901
Capacity: n*b=2 s+wWords
Mappingfrom{B j }to{Z i } Cache BlockZ i (CacheLine) i=0,1,...(m1) Capacity: m*b=2 r+w Words
01 234 567
n>>m,n=2 s,m=2 r Eachblockcontainsbwords withb=2 w
Cache Structures
Direct-Mapped Cache
Direct mapping of n/m = 2s-r memory blocks into one cache line:
Mapping: Bj --> Zi, where i = j mod m
Cache Zeile Line
Tag Zeile Tag Z1 Z0
Main Memory Haupspeicher Block B0

B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
Tag
Z2
Tag
Z3
Cache Structures
Direct-Mapped
Cache
Low hardware complexity. Fixed mapping block line yields fixed replacement strategy.
Cache Structures
Fully Associative Cache Any block in main memory can be mapped to any cache line (flexibility). Replacement strategy tells which line is to be overwritten when loading the cache (e.g. LeastRecently-Used). High hardware complexity.
Cache Structures
Set Associative Cache Compromise between Direct-Mapped- and fully associative Cache. k-way set associative cache:
k lines form one set. m cache-lines are divided into v = m/k sets with k each.
Programmability
Caches have no impact (from a logical point of view)

Designed to be transparent
BUT: large performance impact

Need to use caches efficiently
E.g. Try to reuse data in caches
HPC Applications need to be tailored to caches

Adapt to cache sizes, cache line sizes, and hierarchies Good understanding of architecture required Significant performance gains possible!
Example
Parameters (taken from typical L1 Cache)

32 KB size Cache line size 32 Bytes Cache has 1024 cache lines
Address format:
Rest 10 bit CL select 5 bit / CL offset
Example (cont.)
Full Associativity (m-way associativity)

New cache line can be stored in any of the 1024 CL Selection e.g. by Least Recently Used (LRU)
Direct mapped (1-way associativity)

New cache line can only be placed in defined line
2-way associativity
Only use 9 bits for CL selection, i.e for 512 sets Selection within set can again be done e.g. using LRU
Example (cont.)
4-way associativity
Only use 8 bits for CL selection, i.e for 256 sets Selection can again be done e.g. using LRU
Summary: With increasing associativity

Possible number of target CLs increases More flexibility, less chance of unwanted evictions BUT: Implementation more complex
Cache Hierarchies
Caches are layered

Several levels of laches Each level works independently Transparency still maintained Currently up to 3 levels
CPU
L1 Cache
L2 Cache
Higher levels
Slower, but larger
L3 Cache
Main Memory
Instruction vs. Data Caches
L1 Caches are often split

I-Cache for Instructions D-Cache for Data
Reduces conflicts
Significantly different access patterns
Allows additional optimizations

Processor layout (CPU design) Make use of the special access patterns
Example: Trace Caches for I-Cache Store longer instruction sequences/traces
Cache Optimization
Why does cache architecture have an impact on performance?

Data should be reused as much as possible! Locality of reference:
Temporal locality: recently accessed data is likely be be accessed in the future. Spatial locality: Data located closely together is likely to be accessed closely together in time.
Cache Optimization
How can this be optimized?

Code transformations: Change order of loop iteration executions. Must not change numerical results! Maintain data dependencies!
Cache Optimization
Loop interchange:
Stride = 8
Stride = 1
Other Techniques
Prefetching
Try to preload data that will potentially be used Pro: Data can be pre-requested Con: May waste bandwidth / not used loads
Controlled by Hardware
Speculative loads
Controlled by programmer / compiler

Insert the prefetching statements into the code Traditionally disturbed pipeline! Can be used with multi-core processors with shared cache!
Early CMPs
Intel Montecito Intel Pentium-D AMD Dual Core Opteron IBM Cell
Intel Montecito
Intel Pentium-D
Early CMPs
IBM / Sony / Toshiba Cell Processor:
1 Power Processor Element (PPE) 8 Synergistic Processing Elements (SPE) Element Interface Bus (EIB), 384GB/s 25,6 GB/s memory bandwidth 50-80 Watts energy consumption
Cell Processor
SUN UltraSparc T1
Eight cores, connected via Crossbar 134 GB/s Each core can process 4 threads 25,6GB/s memory bandwidth 70 Watts energy consumption => 2 Watts/Thread
Trends through Multi-Core
Computers move into chip! New memory hierarchies ==> Caches! New interconnect topologies. Three levels of parallelism: On-chip On-board Cluster
28.09.08
Trends through Multi-Core
Computers move into chip! New memory hierarchies ==> Caches! New interconnect topologies. Three levels of parallelism: On-chip On-board Cluster
28.09.08
Contemporary Multicore Chips
Intel Clovertown/Penryn: 4 Cores Split L1 Cache Partly Shared L2 Cache! FSB
28.09.08
AMD Barcelona: 4 Cores Split L1/L2 Cache Shared L3 Cache! On Chip Crossbar
28.09.08

SUN Niagara 2 2 Cores 4 Threads / Core 32 Threads On Chip Crossbar IBM Power 5 / Power 6
28.09.08
Upcoming Archs: Dunnington
28.09.08
Core i7 (Nehalem)
28.09.08
Nehalem: Intel's Next Generation

28.09.08
AMD Shanghai
28.09.08
Larrabee: Intel's Many-Core Architecture
28.09.08
Plenty of x86 in-order cores plus standard 64bit extensions 16 wide SIMD unit per core Fully coherent L1 (32KB) /L2 (256KB) caches Bidirectional ring bus Short in order pipeline 4-way SMT
28.09.08
Larrabee vs. Core
28.09.08
Shared Memory Programming Model:

Pthreads OpenMP Prromises to be standard conform
C / FORTRAN Compiler Key advantage: x86 binary compatibility!
28.09.08
autopin:
A Tool for automatic Optimization of Pinning Processes in Multicore Architectures
28.09.08
Motivation: Many possibilities
28.09.08
66
can lead to non-deterministic runtimes...
... but not necessarily!
The autopin Approach

User-level tool Start multi-threaded application under autopin control User can specify pinnings of interest Pin threads to cores Assess performance of chosen pinning using performance counters Try alternative pinnings until optimal pinning is found
28.09.08
69
Performance Counters
Multiple Event Sensors
ALU Utilization Branch Prediction Cache Events (L1/L2/TLB) Bus Utilization
Two Uses:
Read: Get Precise Count of Events in Code Regions => Counting Interrupt on Overflow => Statistical Sampling
Well-known tools:
Oprofile Perfctr Intel Vtune Perfmon2
perfmon2
Kernel-Patch + library (libpfm) Generic interface for PMU access Portable: implementations for IA32, x64, IA64, MIPS, Power Allows for per-thread and system-wide monitoring Support for counting and sampling pfmon:
attach to running threads fork new processes and attach to them fully exploit performance counters
autopin Strategy
numOfPinnings = 3; pinning = {"1984", "182B", "58BE"}; for (i=0; i<numOfPinnings; i++) { pinThreads(pinning[i]); runThreads(warmupTime); p1 = readPerformanceCounter(); runThreads(sampleTime); p2 = readPerformanceCounter(); performanceRate[i] = (p2-p1)/sampleTime; } pinThreads(bestPinning);
Experimental Setup
Caneland:
Intel Tigertown: Quad-Core, 2x4MB L2/socket, 2.93GHz clock rate 4-way, 4x1066MHz FSB, 64MB snoop filter, UMA
Clovertown:
Intel Clovertown: Quad-Core, 2x4MB L2, 2.66GHz clock rate 2-way, 1x1333MHz FSB, UMA
Barcelona:
AMD K10: Quad-Core, 4x512kB L2, 1x2MB L3, 1.9GHz clock rate 2-way, 1000MHz Hypertransport, NUMA
Linux Kernel 2.6.23 with perfmon2 patches SPEC OMP Benchmark Intel Compiler Suite
Caneland
Barcelona
Results
Caneland 2 310.wupwise 312.swim 314.mgrid 316.applu 320.equake 324.apsi 328.fma3d 330.art 332.ammp 4 8 Clovertown 2 4 Barcelona 2 4
Conclusions and Outlook

Single core, non parallel systems will disappear Profound knowledge of parallel programming will be required to fully exploit multi- and manycore systems In addition, knowledge about: Cache / cache hierarchies
On chip interconnect Memory hierarchies
is essential!
28.09.08
77
Conclusions and Outlook 2

Pinning is essential!
We will see more and more GPU features in main processors!
28.09.08
78
Gracias! Thank you!
28.09.08
79

Multicore Architecture CTrinitis

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multicore Architecture CTrinitis

Загружено:

Авторское право:

Доступные форматы

Technische Universitt Mnchen

LRR-TUM, September 9th, 2008

How did it all evolve?

Technische Universitt Mnchen

Abacus, 3000 BC (?)

1822 Charles Babbage 1642, add & sub, Blaise Pascal

Electromechanical Machines Based on Relays

Technische Universitt Mnchen

Technische Universitt Mnchen

Electronic Computers First Generation

Technische Universitt Mnchen

1946: ENIAC machine

Technische Universitt Mnchen

Technische Universitt Mnchen

Original work (1946)

The IAS machine

Technische Universitt Mnchen

Developed 1952 by John von Neumann

Technische Universitt Mnchen

Next step: Integrated Circuits

Many subsequent improvements

1971: 1 Microprocessor Intel 4004

Technische Universitt Mnchen

Transistors 108 KHz, 10000nm ,

Intel 4004 First Microprocessor

Pentium 4 (55 Million Transistors)

Technische Universitt Mnchen

1.7 Billion Transistors, Intel's 1st Dual Core, 90nm

Dual Core 2 (Woodcrest)

Technische Universitt Mnchen

290 Million Transistors , 2.4-3 GHz, 65nm

Technische Universitt Mnchen

731 Million Transistors, 45nm

Technische Universitt Mnchen

705 Million Transistors, 45nm

Technische Universitt Mnchen

... Transistors, 45nm

And the Future ... ?

Technische Universitt Mnchen

Large, Scalar cores for high single-thread performance

What happens inside the Core?

Technische Universitt Mnchen

Simplified Instruction Cycle

From Single- to Multi-Core

Technische Universitt Mnchen

Technische Universitt Mnchen

Just an SMP system?

Technische Universitt Mnchen

Partly, but those issues will be covered by my colleagues...

Technische Universitt Mnchen

Yes, with regard to memory hierarchies and interconnect!

The Memory Wall

Technische Universitt Mnchen

Processor speed is increasing much faster than memory speed

The gap is widening

Technische Universitt Mnchen

Main Memory: Problems with Bandwidth & Latency

Technische Universitt Mnchen

between price and performance Memory component hierarchy:

General Principle of Caches

Technische Universitt Mnchen

transactions Red blocks repeated!

Technische Universitt Mnchen