Академический Документы
Профессиональный Документы
Культура Документы
Multicore Architectures
CLGrid5 Workshop Valparaiso, Chile
September 29th , 2008
Carsten Trinitis
Lehrstuhl fr Rechnertechnik und Rechnerorganisation (LRR) Institut fr Informatik, Technische Universitt Mnchen
Zuse Z3 & Z4
Z1 / 1938, Z3 / 1941: First freely programmable machines in the world Z3 and its successor Z4 can be seen at Deutsches Museum!
Principle
Basic: Triode Controllable flow within diode by a fence On / Off
ENIAC (1946)
Organization Question:
How to structure / organize computational machines? How to control and steer execution?
Result:
von Neumann Architecture Most dominant architecture even today !
Technology Development
Vacuum tubes replaced
Transistors Smaller, more power efficient DEC PDP-1, IBM 7094 Still large machines
st
Intel Montecito
28.09.08
28.09.08
Core i7 (Nehalem)
28.09.08
AMD Shanghai
28.09.08
Intel Larrabee
28.09.08
Multi-core array CMP with ~10 cores Dual core Symmetric multithreading
Many-core array CMP with 10s-100s low power cores Scalar cores Capable of TFLOPS+ Full System-on-Chip Servers, workstations, embedded
Evolution
FI
DI
FD PD WD
Pipelined Execution
FI FD PD WD DI FD PD WD FI DI FD PD WD FD PD WD FD PD WD DI FD PD WD DI FD PD WD FD PD WD PD PD
Fetch Instruction Decode Instruction Fetch Data Process Data Write Data
Superscalar Execution
FI FI DI DI FI FI DI
VLIW Execution
FI
Netburst: >30 Pipeline Stages No longer feasible... 2005: Move to dual core (and less pipeline stages) 2, 4, 6, 8, ... cores But: The free lunch is over! The good news is: This is good for parallel programmers.
Impact
What does multi-core mean in particular? Is it just an SMP system, i.e. programmable with OpenMP, Pthreads, etc. ? Or does it differ from SMP Systems? How do multi-core systems fit into clusters?
Is Multi-Core different?
Performance
o perf CPU
e anc rm
Memory access
Time
Caches
Cache: Memory closer to CPU which hold a subset of the main memory
+ Lower latency, higher bandwidth, On-chip - Which subset should be present? - Can we manage this transparently?
Memory Hierarchy
Compromise
Capacity
With Caches
CPU Slow
CPU Fast
Cache
Memory Redundant
Slow
Memory
Terminology
Accesses to memory can be a Cache hit: Data is in Cache Cache miss: Data has to be retrieved from memory Cache misses are expensive! Cache size: Total size of Cache Cache line size/length: Caches do not store individual bytes/words Management overhead too high Unit of storage: Cache lines
Consecutive number of bytes / memory
Terminology
Replacement policy: Which cache line to evict if new space is needed? Optimal: Data not used in the near future Make prediction from the past Often used: Least recently used (LRU) How are writes treated? Write back caching
Writes are stored in caches Data is written back in case of line eviction
Cache Associativity
Cache Structures
Where is a block stored in the cache?
MainMemory BlockB j j=0,1,...,(n1)
1 1 1 1 1 1 1 1 1 12 2 2 2 2 2 2 2 2 2 3 3 01 234 567 890 123 456 789 012 345 678 901
Mappingfrom{B j }to{Z i } Cache BlockZ i (CacheLine) i=0,1,...(m1) Capacity: m*b=2 r+w Words
01 234 567
Cache Structures
Direct-Mapped Cache
Direct mapping of n/m = 2s-r memory blocks into one cache line:
Mapping: Bj --> Zi, where i = j mod m
Cache Zeile Line
Tag Zeile Tag Z1 Z0
Tag
Z2
Tag
Z3
Cache Structures
Direct-Mapped
Cache
Low hardware complexity. Fixed mapping block line yields fixed replacement strategy.
Cache Structures
Fully Associative Cache Any block in main memory can be mapped to any cache line (flexibility). Replacement strategy tells which line is to be overwritten when loading the cache (e.g. LeastRecently-Used). High hardware complexity.
Cache Structures
Set Associative Cache Compromise between Direct-Mapped- and fully associative Cache. k-way set associative cache:
k lines form one set. m cache-lines are divided into v = m/k sets with k each.
Programmability
Example
Address format:
Rest 10 bit CL select 5 bit / CL offset
Example (cont.)
2-way associativity
Only use 9 bits for CL selection, i.e for 512 sets Selection within set can again be done e.g. using LRU
Example (cont.)
4-way associativity
Only use 8 bits for CL selection, i.e for 256 sets Selection can again be done e.g. using LRU
Cache Hierarchies
CPU
L1 Cache
L2 Cache
Higher levels
Slower, but larger
L3 Cache
Main Memory
Reduces conflicts
Significantly different access patterns
Cache Optimization
Cache Optimization
Cache Optimization
Loop interchange:
Stride = 8
Stride = 1
Other Techniques
Prefetching
Try to preload data that will potentially be used Pro: Data can be pre-requested Con: May waste bandwidth / not used loads
Controlled by Hardware
Speculative loads
Early CMPs
Intel Montecito Intel Pentium-D AMD Dual Core Opteron IBM Cell
Intel Montecito
Intel Pentium-D
Early CMPs
IBM / Sony / Toshiba Cell Processor:
1 Power Processor Element (PPE) 8 Synergistic Processing Elements (SPE) Element Interface Bus (EIB), 384GB/s 25,6 GB/s memory bandwidth 50-80 Watts energy consumption
Cell Processor
SUN UltraSparc T1
Eight cores, connected via Crossbar 134 GB/s Each core can process 4 threads 25,6GB/s memory bandwidth 70 Watts energy consumption => 2 Watts/Thread
Computers move into chip! New memory hierarchies ==> Caches! New interconnect topologies. Three levels of parallelism: On-chip On-board Cluster
28.09.08
Computers move into chip! New memory hierarchies ==> Caches! New interconnect topologies. Three levels of parallelism: On-chip On-board Cluster
28.09.08
28.09.08
AMD Barcelona: 4 Cores Split L1/L2 Cache Shared L3 Cache! On Chip Crossbar
28.09.08
28.09.08
28.09.08
Core i7 (Nehalem)
28.09.08
28.09.08
AMD Shanghai
28.09.08
28.09.08
Plenty of x86 in-order cores plus standard 64bit extensions 16 wide SIMD unit per core Fully coherent L1 (32KB) /L2 (256KB) caches Bidirectional ring bus Short in order pipeline 4-way SMT
28.09.08
28.09.08
28.09.08
autopin:
A Tool for automatic Optimization of Pinning Processes in Multicore Architectures
28.09.08
28.09.08
66
28.09.08
69
Performance Counters
Multiple Event Sensors
ALU Utilization Branch Prediction Cache Events (L1/L2/TLB) Bus Utilization
Two Uses:
Read: Get Precise Count of Events in Code Regions => Counting Interrupt on Overflow => Statistical Sampling
Well-known tools:
Oprofile Perfctr Intel Vtune Perfmon2
perfmon2
Kernel-Patch + library (libpfm) Generic interface for PMU access Portable: implementations for IA32, x64, IA64, MIPS, Power Allows for per-thread and system-wide monitoring Support for counting and sampling pfmon:
attach to running threads fork new processes and attach to them fully exploit performance counters
autopin Strategy
numOfPinnings = 3; pinning = {"1984", "182B", "58BE"}; for (i=0; i<numOfPinnings; i++) { pinThreads(pinning[i]); runThreads(warmupTime); p1 = readPerformanceCounter(); runThreads(sampleTime); p2 = readPerformanceCounter(); performanceRate[i] = (p2-p1)/sampleTime; } pinThreads(bestPinning);
Experimental Setup
Caneland:
Intel Tigertown: Quad-Core, 2x4MB L2/socket, 2.93GHz clock rate 4-way, 4x1066MHz FSB, 64MB snoop filter, UMA
Clovertown:
Intel Clovertown: Quad-Core, 2x4MB L2, 2.66GHz clock rate 2-way, 1x1333MHz FSB, UMA
Barcelona:
AMD K10: Quad-Core, 4x512kB L2, 1x2MB L3, 1.9GHz clock rate 2-way, 1000MHz Hypertransport, NUMA
Linux Kernel 2.6.23 with perfmon2 patches SPEC OMP Benchmark Intel Compiler Suite
Caneland
Barcelona
Results
Caneland 2 310.wupwise 312.swim 314.mgrid 316.applu 320.equake 324.apsi 328.fma3d 330.art 332.ammp 4 8 Clovertown 2 4 Barcelona 2 4
is essential!
28.09.08
77
28.09.08
78
28.09.08
79