You are on page 1of 46

Introduction to Metrics, Applications and Architectures

Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Tcnico e

September 14, 2011

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

1 / 24

Outline

Simple Example: Opportunities for Parallelism Speedup and Overheads Application Areas Parallel Systems

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

2 / 24

Simple Example: Opportunities for Parallelism


x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

Functional Parallelism Data Parallelism Pipelining

No good?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

3 / 24

Simple Example: Opportunities for Parallelism


x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

Functional Parallelism Data Parallelism Pipelining

No good?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

3 / 24

Simple Example: Opportunities for Parallelism


x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

Functional Parallelism Data Parallelism Pipelining

No good?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

3 / 24

Simple Example: Opportunities for Parallelism


x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

Functional Parallelism Data Parallelism Pipelining

No good?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

3 / 24

Simple Example: Opportunities for Parallelism


x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

Functional Parallelism Data Parallelism Pipelining

No good?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

3 / 24

How Much Faster?


Speedup
S= tserial tparallel

Ideal speedup with p processors?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

4 / 24

How Much Faster?


Speedup
S= tserial tparallel
tserial p )

Ideal speedup with p processors? Expected speedup?

S =p

(tparallel =

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

4 / 24

How Much Faster?


Speedup
S= tserial tparallel
tserial p )

Ideal speedup with p processors? Expected speedup? S <p

S =p

(tparallel =

Cant we get superlinear speedup S > p?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

4 / 24

How Much Faster?


Speedup
S= tserial tparallel
tserial p )

Ideal speedup with p processors? Expected speedup? S <p

S =p

(tparallel =

Cant we get superlinear speedup S > p?

Yes!

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

4 / 24

How Much Faster?


Speedup
S= tserial tparallel
tserial p )

Ideal speedup with p processors? Expected speedup? S <p

S =p

(tparallel =

Cant we get superlinear speedup S > p? increased e ciency in memory access

Yes!

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

4 / 24

How Much Faster?


Speedup
S= tserial tparallel
tserial p )

Ideal speedup with p processors? Expected speedup? S <p

S =p

(tparallel =

Cant we get superlinear speedup S > p? increased e ciency in memory access

Yes!

some specic problems (for example, search)

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

4 / 24

Limitations for Ideal Speedup


Overheads that limit parallel speedup?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

5 / 24

Limitations for Ideal Speedup


Overheads that limit parallel speedup?

data transfers (or more generally, communication among tasks) task startup / nalize load balancing inherent sequential portions of computation

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

5 / 24

Eect of Sequential Fraction


tparallel = f tserial + (1 S(p, f ) = f ) tserial p 1 f + 1pf
p!1

lim S(p, f ) =

1 f

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

6 / 24

Eect of Sequential Fraction


tparallel = f tserial + (1 S(p, f ) = f ) tserial p 1 f + 1pf
p!1

lim S(p, f ) =

1 f

f=0%

f=5%

f=10%

f=20%

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

6 / 24

Di culties about Parallel Programming

Algorithm development is harder


dene and coordinate concurrent tasks

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

7 / 24

Di culties about Parallel Programming

Algorithm development is harder


dene and coordinate concurrent tasks

Software programming more complex


low level parallel directives debug signicantly more di cult lack of programming models and environments

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

7 / 24

Di culties about Parallel Programming

Algorithm development is harder


dene and coordinate concurrent tasks

Software programming more complex


low level parallel directives debug signicantly more di cult lack of programming models and environments

Rapid pace of change in computer system architecture


parallel algorithm may not be e cient for next generation of parallel computers

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

7 / 24

Application Areas
Why bother with parallel computation?

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

8 / 24

Application Areas
Why bother with parallel computation? Continued demand for greater computational power from many dierent domains! Two major classes of problems in parallel computation:

Grand Challenge problems


Problems that cannot be solved in a reasonable amount of time with todays computers.

Embarrassingly Parallel problems


Problems whose workload can be easily divided into (almost) independent tasks.
Jos Monteiro (DEI / IST) e Parallel and Distributed Computing 2 2011-11-14 8 / 24

Grand Challenge problems

Global Environmental/Ecosystem Modeling Biomechanics and biomedical imaging Fluid dynamics Molecular nanotechnology Nuclear power and weapons simulations

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

9 / 24

Embarrassingly Parallel problems

Numerical weather forecasting Computer graphics / animation Basic Local Alignment Search Tool (BLAST) in bioinformatics Monte-Carlo methods Genetic algorithms

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

10 / 24

Example: Weather Forecasting

Atmosphere is modeled by dividing it into 3-dimension cells (eg, 1 km3 ).

Time is discretized into intervals (1 second, 1 minute, 1 hour)

Atmospheric conditions (temperature, pressure, humidity, etc) for each cell are computed as a function of neighbors cell conditions in this and previous time intervals.

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

11 / 24

Example: Weather Forecasting


For the forecast of continental Portugal, take an area of 1000km 500km = 5 105 km2 .

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

12 / 24

Example: Weather Forecasting


For the forecast of continental Portugal, take an area of 1000km 500km = 5 105 km2 . Assuming the atmosphere height of 50 km, there are 25 106 cells. If each cell takes 200 oating point operations, we require a total of 5 109 operations for each time interval.

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

12 / 24

Example: Weather Forecasting


For the forecast of continental Portugal, take an area of 1000km 500km = 5 105 km2 . Assuming the atmosphere height of 50 km, there are 25 106 cells. If each cell takes 200 oating point operations, we require a total of 5 109 operations for each time interval. If the second is the time interval, and we want to compute the forecast for tomorrow (almost 105 seconds in a day), there a total of 5 1014 operations.

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

12 / 24

Example: Weather Forecasting


For the forecast of continental Portugal, take an area of 1000km 500km = 5 105 km2 . Assuming the atmosphere height of 50 km, there are 25 106 cells. If each cell takes 200 oating point operations, we require a total of 5 109 operations for each time interval. If the second is the time interval, and we want to compute the forecast for tomorrow (almost 105 seconds in a day), there a total of 5 1014 operations. An Intel Pentium IV 3.2 GHz performs at 3 GFLOPS, hence taking about 40 hours...

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

12 / 24

Example: n-Body Problem

Each body has a given position, velocity, acceleration, that needs to be computed for every time interval.

Each body attracts (and/or repels) every other body. For n bodies, there are a total of n2 interactions that need to be accounted for.

Example: a galaxy has more than 1011 stars, leading to more than 1022 oating point operations for each time interval!

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

13 / 24

Processor Evolution

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

14 / 24

Supercomputer Evolution

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

15 / 24

Top 10 Supercomputers (June 2011)

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

16 / 24

Projected Performance Development

First Peta system available in 2009! Estimate of humans brain computational power: 1014 neural connections at 200 calculations per second ) 20 PFLOPS
Jos Monteiro (DEI / IST) e Parallel and Distributed Computing 2 2011-11-14 17 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

18 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Multicore (SMP)
Set of processors sharing a common main memory.

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

18 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Multicore (SMP)
Set of processors sharing a common main memory.

Massively Parallel Processors (MPP)


Processors with individual main memory with tightly coupled interconnections.

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

18 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Multicore (SMP)
Set of processors sharing a common main memory.

Massively Parallel Processors (MPP)


Processors with individual main memory with tightly coupled interconnections.

Clusters
Processors with individual main memory linked together using InniBand, Quadrics, Myrinet, or Gigabit Ethernet connections.
COW / NOW: Cluster / Network Of Workstations Beowulf: cluster made of PCs running Linux using TCP/IP (COTS: Commodity-O-The-Shelf)

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

18 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Multicore (SMP)
Set of processors sharing a common main memory.

Massively Parallel Processors (MPP)


Processors with individual main memory with tightly coupled interconnections.

Clusters
Processors with individual main memory linked together using InniBand, Quadrics, Myrinet, or Gigabit Ethernet connections.
COW / NOW: Cluster / Network Of Workstations Beowulf: cluster made of PCs running Linux using TCP/IP (COTS: Commodity-O-The-Shelf)

Constellation
MPP / cluster where each node is a multicore.
Jos Monteiro (DEI / IST) e Parallel and Distributed Computing 2 2011-11-14 18 / 24

Evolution of Types of Supercomputers

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

19 / 24

Evolution of Dominant Companies

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

20 / 24

Hierarchy of Computational Power

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

21 / 24

Warehouse-size Computers

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

22 / 24

Warehouse-size Computers

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

22 / 24

Multicores
Sample of todays multicore processors: AMD
Opteron: dual, quad, hex, 8-, 12-cores Phenom: dual, quad, hex cores

Intel
Core i7: six hyperthreaded cores Dunnington (Xeon): six cores

Sun
Niagara: 8 cores; 8-way ne-grain multithreading per core

IBM
Power 7: dual, quad, hex, 8-core Cell: 1 PPC core; 8 SPEs w/ SIMD parallelism
Jos Monteiro (DEI / IST) e Parallel and Distributed Computing 2 2011-11-14 23 / 24

Next Class

technologies for parallel programming

models for computer architecture

review of computer architecture

levels of parallelism

Jos Monteiro (DEI / IST) e

Parallel and Distributed Computing 2

2011-11-14

24 / 24