Вы находитесь на странице: 1из 51

Introduction to Parallel Processing

Shantanu Dutt University of Illinois at Chicago

Acknowledgements

Ashish Agrawal, IIT Kanpur, Fundamentals of Parallel Processing (slides), w/ some modifications and augmentations by Shantanu Dutt John Urbanic, Parallel Computing: Overview (slides), w/ some modifications and augmentations by Shantanu Dutt John Mellor-Crummey, COMP 422 Parallel Computing: An Introduction, Department of Computer Science, Rice University, (slides), w/ some modifications and augmentations by Shantanu Dutt

Outline

Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Classification of parallel computations Classification of parallel architectures - Distributed and Shared memory Simple examples of parallel processing Example applications Future advances Summary

Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

Moores Law & Need for Parallel Processing


Chip performance doubles every 18-24 months Power consumption is prop. to freq. Limits of Serial computing Heating issues Limit to transmissions speeds Leakage currents Limit to miniaturization Multi-core processors already commonplace. Most high performance servers already parallel.

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Quest for Performance

Pipelining Superscalar Architecture Out of Order Execution Caches Instruction Set Design Advancements Parallelism Multi-core processors Clusters Grid

This is the future

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Pipelining

Illustration of Pipeline using the fetch, load, execute, store stages. At the start of execution Wind up. At the end of execution Wind down. Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect branch prediction Hit performance and speedup. Pipeline depth No of cycles in execution simultaneously. Intel Pentium 4 35 stages.

Top text from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

Pipelining

Tpipe(n) is pipelined time to process n instructions = fill-time + max{ti}, ti = exec. time of the ith stage
7

Cache

Desire for fast cheap and non volatile memory Memory speed growth at 7% per annum while processor growth at 50% p.a. Cache fast small memory. L1 and L2 caches. Retrieval from memory takes several hundred clock cycles Retrieval from L1 cache takes the order of one clock cycle and from L2 cache takes the order of 10 clock cycles. Cache hit and miss. Prefetch used to avoid cache misses at the start of the execution of the program. Cache lines used to avoid latency time in case of a cache miss Order of search L1 cache -> L2 cache -> RAM -> Disk Cache coherency Correctness of data. Important for distributed parallel computing Limit to cache improvement: Improving cache performance will at most improve efficiency to match processor efficiency

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

: instruction-level parallelismdegree generally low and dependent on how the sequential code has been written, so not v. effective

(single-instr. multiple data)

(exs. of limited data parallelism)

(exs. of limited & low-level functional parallelism)


9

10

Thus need development of explicit parallel algorithms that are based on a fundamental understanding of the parallelism inherent in a problem, and exploiting that parallelism with minimum interaction/communication between the parallel parts

11

12

13

(simultaneous multithreading)

(multi-threading)

14

15

16

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

17

18

19

Applications of Parallel Processing

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

20

21

22

23

24

25

Example problems & solutions

Easy Parallel Situation Each data part is independent. No communication is required between the execution units solving two different parts. Heat Equation The initial temperature is zero on the boundaries and high in the middle The boundary temperature is held at zero. The calculation of an element is dependent upon its neighbor elements

data1 data2 Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

...

data N 26

1. 2. 3. 4. 5. 6. 7. 8. 9.

find out if I am MASTER or WORKER if I am MASTER initialize array send each WORKER starting info and subarray do until all WORKERS converge gather from all WORKERS convergence data broadcast to all WORKERS convergence signal end do receive results from each WORKER

14.

15.
16. 17. 18. 19. 20.

update border of my portion of solution array determine if my solution has converged if so {send MASTER convergence signal recv. from MASTER convergence signal} end do } send MASTER results endif Serial Code do iy=2, ny-1 do ix=2, nx-1 u2(ix,iy)=u1(ix,iy)+cx*{u1(ix+1,iy)} + u1(ix1,iy) + cy*{u1(ix,iy+1)} + u1(ix,iy-1) enddo Master (can be one of the workers) enddo

1. 2. 3. 4. 1.

5.
6. 7.

else if I am WORKER receive from MASTER starting info and subarray do until solution converged { update time non-blocking send neighbors my border info non-blocking receive neighbors border info Workers update interior of my portion of solution array wait for non-block. commun. to complete

Problem Grid

Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

27

How to interconnect the multiple cores/processors is a major consideration in a parallel architecture


28

Tflops

Tflops

kW

29

Parallelism - A simplistic understanding


Data Parallelism Functional or Control Parallelism

Data Parallelism - Divide the dataset and solve each sector similarly on a separate execution unit. Functional Parallelism Divide the 'problem' into different tasks and execute the tasks on different units. What would func. parallelism look like for the example on the right?

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Data Parallelism

Sequential

Multiple tasks at once. Distribute work into multiple execution units. A classification of parallelism:

30

Data Parallelism

Functional Parallelism

16/12/2008

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

31

Flynns Classification

Flynn's Classical Taxonomy Single Instruction, Single Data (SISD)your single-core uniprocessor PC Single Instruction, Multiple Data (SIMD)special purpose lowgranularity multi-processor m/c w/ a single control unit relaying the same instruction to all processors (w/ different data) every cc Multiple Instruction, Single Data (MISD)pipelining is a major example Multiple Instruction, Multiple Data (MIMD)the most prevalent model. SPMD (Single Program Multiple Data) is a very useful subset. Note that this is v. different from SIMD. Why? Note that Data vs Control Parallelism is another independent classification to the above

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

32

Flynns Classification (contd).

33

Flynns Classification (contd).

34

Flynns Classification (contd).

35

Flynns Classification (contd).

36

Flynns Classification (contd).

37

Flynns Classification (contd).

Data Parallelism: SIMD and SPMD fall into this category Functional Parallelism: MISD falls into this category

38

Parallel Arch. Classification

Multi-processor Architectures Distributed MemoryMost prevalent architecture model for # processors > 8 Indirect interconnectionn n/ws Direct interconnection n/ws Shared Memory

Uniform Memory Access (UMA) Non- Uniform Memory Access (NUMA)Distributed shared memory

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

39

Distributed MemoryMessage Passing Architectures

Each processor P (with its own local cache C) is connected to exclusive local memory, i.e. no other CPU has direct access to it. Each node comprises at least one network interface (NI) that mediates the connection to a communication network. On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network. Non-blocking vs Blocking communication Example: A 2x4 Direct vs Indirect mesh n/w (direct Communication/Interconnection connection n/w) network
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 40

The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)

Has 56 compute nodes/computers and a master node


Master here has a different meaninggenerally a system front-end where you login and perform various tasks before submitting your parallel code to run on several compute nodesthan the master node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution problem), which would actually be one of the compute nodes, and generally distributes data to the other compute nodes, monitors progress of the computation, determines the end of the computation, etc., and may also additionally perform a part of the computation

Compute nodes are divided among 14 zones, each zone containing 4 nodes which are connected as a ring network. Zones are connected to each other by a higher-level n/w. Each node (compute or master) has 2 processors. Each processor on some nodes are single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes

41

System Computational Actions in a Message-Passing Program


Proc. X Proc. Y Proc. X Proc. Y

Message passing mapping

a := b+c;

b := x*y;

recv(P2, b); a := b+c;

b := x*y; send(P1,b);

(a) Two basic parallel processes X, Y, and their data dependency

Processor/core containing X

b
P(X) P(Y)

Processor/core containing Y

Message passing Link (direct of data item b. or indirect) betw. the 2 processors

(b) Their mapping to a message-passing multicomputer


42

Distributed Shared Memory Arch.: UMA


Flat memory model Memory bandwidth and latency are the same for all processors and all memory locations. Simplest example dual core processor Most commonly represented today by Symmetric Multiprocessor (SMP) machines Cache coherent UMAconsistent cache values of the same data item in different proc./core caches

L1 cache L2 cache

Dual-Core

Quad-Core
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 43

System Computational Actions in a Shared-Memory Program


Proc. X Proc. Y Proc. X Proc. Y

Shared-memory mapping

a := b+c;

b := x*y;
Possible Actions by O.S.: (i) Since b is a shared data item (e.g., designated by compiler or programmer), check bs location to see if it has been written to by Y or any process (if dont care about the writing process). (ii) If so {read b & decrement read_cntr for b} else go to (i) and busy wait (check periodically).

a := b+c;

b := x*y; Possible Actions by O.S.:


(i) Since b is a shared data item (e.g., designated by compiler or programmer), check bs location to see if it can be written to (all prev. reads done: read_cntr for b = 0). (ii) If so, write b to its location and mark status bit as written by Y. Initialize read_cntr for b to pre-determined value

(a) Two basic parallel processes X, Y, and their data dependency

P(X)

P(Y)

Shared Memory

(b) Their mapping to a shared-memory multiprocessor


44

Distributed Shared Memory Arch.: NUMA


Memory is physically distributed but logically shared. The physical layout similar to the distributed-memory message-passing case Aggregated memory of the whole system appear as one single address space. Due to the distributed nature, memory access performance varies depending on which CPU accesses which parts of memory (local vs. remote access). Two locality domains linked through a high speed connection called Hyper Transport (in general via a link, as in message passing archs, only here these links are used by the O.S. to transmit read/write non-local data to/from processor/non-local memory). Advantage Scalability (compared to UMAs) Disadvantage a) Locality Problems and Connection congestion. b) Not a natural parallel prog./algo. Model (it is easier to partition data among procs instead of think of all of it occupying a large monolithic address space that each proc. can access).

2x2 mesh connection

Most text from Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

45

46

47

An example of an SPMD message-passing parallel program

48

SPMD message-passing parallel program (contd.)

node xor D,

49

Summary

Serial computers / microprocessors will probably not get much faster parallelization unavoidable Pipelining, cache and other optimization strategies for serial computers reaching a plateau Data and functional parallelism Flynns taxonomy: SIMD, MISD, MIMD/SPMD Parallel Architectures Intro

Distributed Memory Shared Memory


Uniform Memory Access Non Uniform Memory Access

Application examples Parallel program/algorithm examples

Most text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur

50

Additional References

Computer Organization and Design Patterson Hennessey Modern Operating Systems Tanenbaum Concepts of High Performance Computing Georg Hager Gerhard Wellein Cramming more components onto Integrated Circuits Gordon Moore, 1965 Introduction to Parallel Computing https://computing.llnl.gov/tutorials/parallel_comp The Landscape of Parallel Computing Research A view from Berkeley, 2006

Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

51

Вам также может понравиться