CS10 Ch6 Lecture

Morgan Kaufmann Publishers
4 December 2015
COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface
5th
Edition
Chapter 6
Parallel Processors from
Client to Cloud
Goal: connecting multiple computers

to get higher performance
n
n
High throughput for independent jobs
Parallel processing program

n
Multiprocessors
Scalability, availability, power efficiency
Task-level (process-level) parallelism

n
6.1 Introduction
Introduction
Single program run on multiple processors
Multicore microprocessors
n
Chips with multiple processors (cores)

Chapter 6 Parallel Processors from Client to Cloud 2
Chapter 7 Multicores, Multiprocessors, and Clusters
4 December 2015
Hardware and Software

n
Hardware
n
n
Software
n
n
Serial: e.g., Pentium 4

Parallel: e.g., quad-core Xeon e5345
Sequential: e.g., matrix multiplication
Concurrent: e.g., operating system
Sequential/concurrent software can run on

serial/parallel hardware
n
Challenge: making effective use of parallel

hardware
What Weve Already Covered

n
2.11: Parallelism and Instructions

n
3.6: Parallelism and Computer Arithmetic

n
Synchronization
Subword Parallelism
4.10: Parallelism and Advanced

Instruction-Level Parallelism
5.10: Parallelism and Memory
Hierarchies
n
Cache Coherence
4 December 2015
n
n
Parallel software is the problem

Need to get significant performance
improvement
n
Otherwise, just use a faster uniprocessor,

since its easier!
Difficulties
n
n
n
Partitioning
Coordination
Communications overhead
6.2 The Difficulty of Creating Parallel Processing Programs
Parallel Programming
Amdahls Law
n
n
Sequential part can limit speedup

Example: 100 processors, 90 speedup?
n
Tnew = Tparallelizable/100 + Tsequential
Speedup =
Solving: Fparallelizable = 0.999
1
= 90
(1 Fparallelizable ) + Fparallelizable /100
Need sequential part to be 0.1% of original

time
4 December 2015
Scaling Example
n
Workload: sum of 10 scalars, and 10 10 matrix

sum
n
n
n
Single processor: Time = (10 + 100) tadd

10 processors
n
n
Time = 10 tadd + 100/10 tadd = 20 tadd

Speedup = 110/20 = 5.5 (55% of potential)
100 processors
n
n
Speed up from 10 to 100 processors

Speedup = 110/11 = 10 (10% of potential)
Assumes load can be balanced across

processors
Scaling Example (cont)

n
n
n
What if matrix size is 100 100?

Single processor: Time = (10 + 10000) tadd
10 processors
n
n
100 processors
n
n

Speedup = 10010/1010 = 9.9 (99% of potential)
Speedup = 10010/110 = 91 (91% of potential)
Assuming load balanced
4 December 2015
Strong vs Weak Scaling

n
Strong scaling: problem size fixed

n
As in example
Weak scaling: problem size proportional to

number of processors
n
10 processors, 10 10 matrix
n
100 processors, 32 32 matrix

n
Time = 20 tadd
Constant performance in this example

An alternate classification
Data Streams
Single
Instruction Single
Streams
Multiple
Multiple
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
SPMD: Single Program Multiple Data

n
n
6.3 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams
A parallel program on a MIMD computer

Conditional code for different processors
4 December 2015
Example: DAXPY (Y = a X + Y)
Conventional MIPS code
l.d
$f0,a($sp)
addiu r4,$s0,#512
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
n Vector MIPS code
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)
n
;load scalar a
;upper bound of what to load
;load x(i)
;a x(i)
;load y(i)
;a x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;load scalar a
;load vector x
;vector-scalar multiply
;load vector y
;add y to product
;store the result
Vector Processors
n
n
Highly pipelined function units

Stream data from/to vector registers to units
n
n
Data collected from memory into registers

Results stored from registers to memory
Example: Vector extension to MIPS

n
n
32 64-element registers (64-bit elements)

Vector instructions
n
n
n
lv, sv: load/store vector

addv.d: add vectors of double
addvs.d: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth

4 December 2015
Vector vs. Scalar

n
Vector architectures and compilers

Simplify data-parallel programming
n Explicit statement of absence of loop-carried
dependences
n
Reduced checking in hardware
Regular access patterns benefit from

interleaved and burst memory
n Avoid control hazards by avoiding loops
n
More general than ad-hoc media

extensions (such as MMX, SSE)
n
Better match with compiler technology

SIMD
n
Operate elementwise on vectors of data

n
E.g., MMX and SSE instructions in x86

n
All processors execute the same

instruction at the same time
n
n
n
n
Multiple data elements in 128-bit wide registers
Each with different data address, etc.
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel
applications
4 December 2015
Vector vs. Multimedia Extensions

n
Vector instructions have a variable vector width,

multimedia extensions have a fixed width
Vector instructions support strided access,
multimedia extensions do not
Vector units can be combination of pipelined and
arrayed functional units:
Performing multiple threads of execution in

parallel
n
n
Fine-grain multithreading
n
n
n
Replicate registers, PC, etc.

Fast switching between threads
6.4 Hardware Multithreading
Multithreading
Switch threads after each cycle

Interleave instruction execution
If one thread stalls, others are executed
Coarse-grain multithreading
n
n
Only switch on long stall (e.g., L2-cache miss)

Simplifies hardware, but doesnt hide short stalls
(eg, data hazards)
4 December 2015
Simultaneous Multithreading
n
In multiple-issue dynamically scheduled

processor
Schedule instructions from multiple threads
n Instructions from independent threads execute
when function units are available
n Within threads, dependencies handled by
scheduling and register renaming
n
Example: Intel Pentium-4 HT

n
Two threads: duplicated registers, shared

function units and caches
Multithreading Example
4 December 2015
Future of Multithreading
n
n
Will it survive? In what form?

Power considerations simplified
microarchitectures
n
Tolerating cache-miss latency

n
Simpler forms of multithreading

Thread switch may be most effective
Multiple simple cores might share

resources more effectively
SMP: shared memory multiprocessor

n
n
n
Hardware provides single physical

address space for all processors
Synchronize shared variables using locks
Memory access time
n
UMA (uniform) vs. NUMA (nonuniform)
6.5 Multicore and Other Shared Memory Multiprocessors
Shared Memory
10
4 December 2015
Example: Sum Reduction

n
Sum 100,000 numbers on 100 processor UMA

n
n
n
Each processor has ID: 0 Pn 99

Partition 1000 numbers per processor
Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
Now need to add these partial sums

n
n
n
Reduction: divide and conquer

Half the processors add pairs, then quarter,
Need to synchronize between reduction steps
Example: Sum Reduction
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
11
4 December 2015
Early video cards

n
3D graphics processing
n
n
n
Frame buffer memory with address generation for

video output
Originally high-end computers (e.g., SGI)
Moores Law lower cost, higher density
3D graphics cards for PCs and game consoles
Graphics Processing Units

n
n
6.6 Introduction to Graphics Processing Units
History of GPUs
Processors oriented to 3D graphics tasks

Vertex/pixel processing, shading, texture mapping,
rasterization
Graphics in the System
12
4 December 2015
GPU Architectures
n
Processing is highly data-parallel

n
GPUs are highly multithreaded

Use thread switching to hide memory latency
Graphics memory is wide and high-bandwidth
Trend toward general purpose GPUs

n
n
Less reliance on multi-level caches
Heterogeneous CPU/GPU systems

CPU for sequential code, GPU for parallel code
Programming languages/APIs
n
n
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Example: NVIDIA Tesla

Streaming
multiprocessor
8 Streaming
processors
13
4 December 2015
Example: NVIDIA Tesla

n
Streaming Processors
n
n
Single-precision FP and integer units

Each SP is fine-grained multithreaded
Warp: group of 32 threads

n
Executed in parallel,
SIMD style
n
8 SPs
4 clock cycles
Hardware contexts
for 24 warps
n
Registers, PCs,
Putting GPUs into Perspective

Feature
Multicore with SIMD
GPU
SIMD processors
4 to 8
8 to 16
SIMD lanes/processor
2 to 4
8 to 16
Multithreading hardware support for

SIMD threads
2 to 4
16 to 32
2:1
2:1
Largest cache size
8 MB
0.75 MB
Size of memory address
64-bit
64-bit
Typical ratio of single precision to

double-precision performance
Size of main memory
8 GB to 256 GB
4 GB to 6 GB
Memory protection at level of page
Yes
Yes
Demand paging
Yes
No
Integrated scalar processor/SIMD

processor
Yes
No
Cache coherent
Yes
No
14
4 December 2015
Guide to GPU Terms
Each processor has private physical

address space
Hardware sends/receives messages
between processors
6.7 Clusters, WSC, and Other Message-Passing MPs
Message Passing
15
4 December 2015
Loosely Coupled Clusters

n
Network of independent computers

n
n
Each has private memory and OS

Connected using I/O system
n
Suitable for applications with independent tasks

n
n
n
E.g., Ethernet/switch, Internet
Web servers, databases, simulations,
High availability, scalable, affordable

Problems
n
n
Administration cost (prefer virtual machines)

Low interconnect bandwidth
n
c.f. processor/memory bandwidth on an SMP

Sum Reduction (Again)

n
n
Sum 100,000 on 100 processors

First distribute 100 numbers to each
n
The do partial sums

sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
Reduction
Half the processors send, other half receive
and add
n The quarter send, quarter receive and add,
n
16
4 December 2015
Sum Reduction (Again)

n
Given send() and receive() operations

limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
n
n
Send/receive also provide synchronization

Assumes send/receive take similar time to addition
Grid Computing
n
Separate computers interconnected by

long-haul networks
n
n
E.g., Internet connections

Work units farmed out, results sent back
Can make use of idle time on PCs

n
E.g., SETI@home, World Community Grid
17
4 December 2015
Network topologies
n
Arrangements of processors, switches, and links
Bus
Ring
N-cube (N = 3)
2D Mesh
6.8 Introduction to Multiprocessor Network Topologies
Interconnection Networks
Fully connected
Multistage Networks
18
4 December 2015
Network Characteristics
n
Performance
n
n
Latency per message (unloaded network)

Throughput
n
n
n
n
n
n
Link bandwidth
Total network bandwidth
Bisection bandwidth
Congestion delays (depending on traffic)
Cost
Power
Routability in silicon
Code or Applications?
n
Traditional benchmarks
n
Fixed code and data sets
Parallel programming is evolving

Should algorithms, programming languages,
and tools be part of the system?
n Compare systems, provided they implement a
given application
n E.g., Linpack, Berkeley Design Patterns
n
Would foster innovation in approaches to

parallelism
19
4 December 2015
Optimizing Performance
n
Optimize FP performance
Balance adds & multiplies
n Improve superscalar ILP
and use of SIMD
instructions
n
Optimize memory usage

n
Software prefetch
n
Avoid load stalls
Memory affinity
n
Avoid non-local data

accesses
Use OpenMP:
void dgemm (int n, double* A, double* B, double* C)

{
#pragma omp parallel for
for ( int sj = 0; sj < n; sj += BLOCKSIZE )
for ( int si = 0; si < n; si += BLOCKSIZE )
for ( int sk = 0; sk < n; sk += BLOCKSIZE )
do_block(n, si, sj, sk, A, B, C);
}
6.12 Going Faster: Multiple Processors and Matrix Multiply
Multi-threading DGEMM
20
4 December 2015
Multithreaded DGEMM
Multithreaded DGEMM
21
4 December 2015
Amdahls Law doesnt apply to parallel

computers
n
n
Since we can achieve linear speedup

But only on applications with weak scaling
6.13 Fallacies and Pitfalls
Fallacies
Peak performance tracks observed

performance
n
n
n
Marketers like this approach!

But compare Xeon with others in example
Need to be aware of bottlenecks
Pitfalls
n
Not developing the software to take

account of a multiprocessor architecture
n
Example: using a single lock for a shared

composite resource
n
Serializes accesses, even if they could be done in

parallel
Use finer-granularity locking
22
4 December 2015
Goal: higher performance by using multiple

processors
Difficulties
n
n
Developing parallel software

Devising appropriate architectures
6.14 Concluding Remarks
Concluding Remarks
SaaS importance is growing and clusters are a

good match
Performance per dollar and performance per
Joule drive both mobile and WSC
Concluding Remarks (cont)

n
SIMD and vector

operations match
multimedia applications
and are easy to
program
23

CS10 Ch6 Lecture

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CS10 Ch6 Lecture

Загружено:

Авторское право:

Доступные форматы

Morgan Kaufmann Publishers

COMPUTER ORGANIZATION AND DESIGN

Goal: connecting multiple computers

High throughput for independent jobs

Parallel processing program

Task-level (process-level) parallelism

Single program run on multiple processors

Chips with multiple processors (cores)

Chapter 7 Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers

Hardware and Software

Serial: e.g., Pentium 4

Sequential/concurrent software can run on

Challenge: making effective use of parallel

What Weve Already Covered

2.11: Parallelism and Instructions

3.6: Parallelism and Computer Arithmetic

4.10: Parallelism and Advanced

Chapter 6 Parallel Processors from Client to Cloud 4

Chapter 7 Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers

Parallel software is the problem

Otherwise, just use a faster uniprocessor,

6.2 The Difficulty of Creating Parallel Processing Programs

Chapter 6 Parallel Processors from Client to Cloud 5

Sequential part can limit speedup

Tnew = Tparallelizable/100 + Tsequential

Solving: Fparallelizable = 0.999

Need sequential part to be 0.1% of original

Chapter 7 Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers

Workload: sum of 10 scalars, and 10 10 matrix

Single processor: Time = (10 + 100) tadd

Time = 10 tadd + 100/10 tadd = 20 tadd

Speed up from 10 to 100 processors

Time = 10 tadd + 100/100 tadd = 11 tadd

Assumes load can be balanced across

Scaling Example (cont)

What if matrix size is 100 100?

Time = 10 tadd + 10000/10 tadd = 1010 tadd

Assuming load balanced

Chapter 6 Parallel Processors from Client to Cloud 8

Chapter 7 Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers

Strong vs Weak Scaling

Strong scaling: problem size fixed

Weak scaling: problem size proportional to

100 processors, 32 32 matrix

Constant performance in this example

SPMD: Single Program Multiple Data

6.3 SISD, MIMD, SIMD, SPMD, and Vector

Instruction and Data Streams

A parallel program on a MIMD computer

Chapter 7 Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers

Chapter 6 Parallel Processors from Client to Cloud 11

Highly pipelined function units

Data collected from memory into registers

Example: Vector extension to MIPS

32 64-element registers (64-bit elements)

lv, sv: load/store vector

Significantly reduces instruction-fetch bandwidth

Chapter 7 Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers