You are on page 1of 46

# Introduction to Metrics, Applications and Architectures

Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Tcnico e

## Parallel and Distributed Computing 2

2011-11-14

1 / 24

Outline

Simple Example: Opportunities for Parallelism Speedup and Overheads Application Areas Parallel Systems

2011-11-14

2 / 24

## Simple Example: Opportunities for Parallelism

x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

No good?

2011-11-14

3 / 24

## Simple Example: Opportunities for Parallelism

x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

No good?

2011-11-14

3 / 24

## Simple Example: Opportunities for Parallelism

x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

No good?

2011-11-14

3 / 24

## Simple Example: Opportunities for Parallelism

x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

No good?

2011-11-14

3 / 24

## Simple Example: Opportunities for Parallelism

x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; } finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z);

No good?

2011-11-14

3 / 24

## How Much Faster?

Speedup
S= tserial tparallel

2011-11-14

4 / 24

## How Much Faster?

Speedup
S= tserial tparallel
tserial p )

S =p

(tparallel =

2011-11-14

4 / 24

## How Much Faster?

Speedup
S= tserial tparallel
tserial p )

S =p

(tparallel =

2011-11-14

4 / 24

## How Much Faster?

Speedup
S= tserial tparallel
tserial p )

S =p

(tparallel =

Yes!

2011-11-14

4 / 24

## How Much Faster?

Speedup
S= tserial tparallel
tserial p )

S =p

(tparallel =

Yes!

2011-11-14

4 / 24

## How Much Faster?

Speedup
S= tserial tparallel
tserial p )

S =p

(tparallel =

Yes!

2011-11-14

4 / 24

2011-11-14

5 / 24

## Limitations for Ideal Speedup

data transfers (or more generally, communication among tasks) task startup / nalize load balancing inherent sequential portions of computation

2011-11-14

5 / 24

## Eect of Sequential Fraction

tparallel = f tserial + (1 S(p, f ) = f ) tserial p 1 f + 1pf
p!1

lim S(p, f ) =

1 f

2011-11-14

6 / 24

## Eect of Sequential Fraction

tparallel = f tserial + (1 S(p, f ) = f ) tserial p 1 f + 1pf
p!1

lim S(p, f ) =

1 f

f=0%

f=5%

f=10%

f=20%

2011-11-14

6 / 24

2011-11-14

7 / 24

## Software programming more complex

low level parallel directives debug signicantly more di cult lack of programming models and environments

2011-11-14

7 / 24

## Software programming more complex

low level parallel directives debug signicantly more di cult lack of programming models and environments

## Rapid pace of change in computer system architecture

parallel algorithm may not be e cient for next generation of parallel computers

## Parallel and Distributed Computing 2

2011-11-14

7 / 24

Application Areas
Why bother with parallel computation?

## Parallel and Distributed Computing 2

2011-11-14

8 / 24

Application Areas
Why bother with parallel computation? Continued demand for greater computational power from many dierent domains! Two major classes of problems in parallel computation:

## Grand Challenge problems

Problems that cannot be solved in a reasonable amount of time with todays computers.

## Embarrassingly Parallel problems

Jos Monteiro (DEI / IST) e Parallel and Distributed Computing 2 2011-11-14 8 / 24

## Grand Challenge problems

Global Environmental/Ecosystem Modeling Biomechanics and biomedical imaging Fluid dynamics Molecular nanotechnology Nuclear power and weapons simulations

2011-11-14

9 / 24

## Embarrassingly Parallel problems

Numerical weather forecasting Computer graphics / animation Basic Local Alignment Search Tool (BLAST) in bioinformatics Monte-Carlo methods Genetic algorithms

2011-11-14

10 / 24

## Time is discretized into intervals (1 second, 1 minute, 1 hour)

Atmospheric conditions (temperature, pressure, humidity, etc) for each cell are computed as a function of neighbors cell conditions in this and previous time intervals.

2011-11-14

11 / 24

## Example: Weather Forecasting

For the forecast of continental Portugal, take an area of 1000km 500km = 5 105 km2 .

2011-11-14

12 / 24

## Example: Weather Forecasting

For the forecast of continental Portugal, take an area of 1000km 500km = 5 105 km2 . Assuming the atmosphere height of 50 km, there are 25 106 cells. If each cell takes 200 oating point operations, we require a total of 5 109 operations for each time interval.

2011-11-14

12 / 24

## Example: Weather Forecasting

For the forecast of continental Portugal, take an area of 1000km 500km = 5 105 km2 . Assuming the atmosphere height of 50 km, there are 25 106 cells. If each cell takes 200 oating point operations, we require a total of 5 109 operations for each time interval. If the second is the time interval, and we want to compute the forecast for tomorrow (almost 105 seconds in a day), there a total of 5 1014 operations.

2011-11-14

12 / 24

## Example: Weather Forecasting

For the forecast of continental Portugal, take an area of 1000km 500km = 5 105 km2 . Assuming the atmosphere height of 50 km, there are 25 106 cells. If each cell takes 200 oating point operations, we require a total of 5 109 operations for each time interval. If the second is the time interval, and we want to compute the forecast for tomorrow (almost 105 seconds in a day), there a total of 5 1014 operations. An Intel Pentium IV 3.2 GHz performs at 3 GFLOPS, hence taking about 40 hours...

2011-11-14

12 / 24

## Example: n-Body Problem

Each body has a given position, velocity, acceleration, that needs to be computed for every time interval.

Each body attracts (and/or repels) every other body. For n bodies, there are a total of n2 interactions that need to be accounted for.

Example: a galaxy has more than 1011 stars, leading to more than 1022 oating point operations for each time interval!

## Parallel and Distributed Computing 2

2011-11-14

13 / 24

Processor Evolution

## Parallel and Distributed Computing 2

2011-11-14

14 / 24

Supercomputer Evolution

2011-11-14

15 / 24

2011-11-14

16 / 24

## Projected Performance Development

First Peta system available in 2009! Estimate of humans brain computational power: 1014 neural connections at 200 calculations per second ) 20 PFLOPS
Jos Monteiro (DEI / IST) e Parallel and Distributed Computing 2 2011-11-14 17 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

## Parallel and Distributed Computing 2

2011-11-14

18 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Multicore (SMP)
Set of processors sharing a common main memory.

## Parallel and Distributed Computing 2

2011-11-14

18 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Multicore (SMP)
Set of processors sharing a common main memory.

## Massively Parallel Processors (MPP)

Processors with individual main memory with tightly coupled interconnections.

## Parallel and Distributed Computing 2

2011-11-14

18 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Multicore (SMP)
Set of processors sharing a common main memory.

## Massively Parallel Processors (MPP)

Processors with individual main memory with tightly coupled interconnections.

Clusters
Processors with individual main memory linked together using InniBand, Quadrics, Myrinet, or Gigabit Ethernet connections.
COW / NOW: Cluster / Network Of Workstations Beowulf: cluster made of PCs running Linux using TCP/IP (COTS: Commodity-O-The-Shelf)

## Parallel and Distributed Computing 2

2011-11-14

18 / 24

Types of Supercomputers
Processor Arrays (SIMD)
Name associated with vector processing, very popular in early supercomputers.

Multicore (SMP)
Set of processors sharing a common main memory.

## Massively Parallel Processors (MPP)

Processors with individual main memory with tightly coupled interconnections.

Clusters
Processors with individual main memory linked together using InniBand, Quadrics, Myrinet, or Gigabit Ethernet connections.
COW / NOW: Cluster / Network Of Workstations Beowulf: cluster made of PCs running Linux using TCP/IP (COTS: Commodity-O-The-Shelf)

Constellation
MPP / cluster where each node is a multicore.
Jos Monteiro (DEI / IST) e Parallel and Distributed Computing 2 2011-11-14 18 / 24

2011-11-14

19 / 24

2011-11-14

20 / 24

## Parallel and Distributed Computing 2

2011-11-14

21 / 24

Warehouse-size Computers

## Parallel and Distributed Computing 2

2011-11-14

22 / 24

Warehouse-size Computers

## Parallel and Distributed Computing 2

2011-11-14

22 / 24

Multicores
Sample of todays multicore processors: AMD

Intel
Core i7: six hyperthreaded cores Dunnington (Xeon): six cores

Sun
Niagara: 8 cores; 8-way ne-grain multithreading per core

IBM
Power 7: dual, quad, hex, 8-core Cell: 1 PPC core; 8 SPEs w/ SIMD parallelism
Jos Monteiro (DEI / IST) e Parallel and Distributed Computing 2 2011-11-14 23 / 24

Next Class

## review of computer architecture

levels of parallelism

2011-11-14

24 / 24