Вы находитесь на странице: 1из 39

Parallel Processors

Session4
Program Partitioning
and
Computational Granularity
Grain Size
• Grain is a program segment
• Grain size is a measure of the amount of computation
– A simple measure is the number of instructions in the grain
– Determines the basic program segments for parallel
processing
Program Partitioning
• Grain Sizes:
– Fine
– Medium
– Course
Granularity & Parallelism Level
Granularity depends on the level of parallel processing
• Instruction Level
– Less than 20 instructions
– Two to thousands of parallel segments
• Loop Level
– Less than 500 instructions
• Procedure Level
– Less than 2000 instructions
• Subprogram Level
– Thousands of instructions
• Job (Program) Level
– Tens of thousands of instructions
Levels of Parallelism

Message
Passing

Shared-Variable
Communication
Communication Latency

• A time measure of the communication overhead


between machine subsystems:
– Memory latency:
• Time needed to access memory
– Synchronization latency:
• Time needed to synchronize two processors
– In both cases a limiting factor in terms of scalability
• Memory size
• Number of processors
• Latency depends on:
– Machine Architecture
– Implementing Technology
– Communication Pattern
Granularity & Latency

• Computational Granularity and


communication latency are closely related
• For a better performance granularity and
latency must be balanced
• Techniques for
– Hiding latencies
– Minimizing communication latencies
– Optimizing grain size
Optimal Grain Size

• How can we partition a program for


shortest possible execution time?
• What is the optimal grain size?
– Number of grains
– Size of grains
– Grain scheduling
• Computation time
• Communication time
Grain Packing and Scheduling

• A method for program partitioning and


scheduling
• The objective is to produce a short
schedule for fast execution of
subdivided program modules
• A program graph is used for grain
packing
Program Graph

• Shows the structure of the program


• Similar to dependence graph
• Each node is a computational unit
• Grain size is measured by number of
basic machine cycles (processor and
memory)
Program Graph

• Nodes (n,s)
– n: node name or id
– s: the grain size of the node
• Edges (v,d)
– v: the output value from the source node which is
the input value to the destination node
– d: communication delay between the source node
and the destination node
Example
Var a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q
Begin
1. a=1
2. b=2
3. c=3
4. d=4
5. e=5
6. f=6
7. g=axb
8. h=cxd
9. i=dxe
10. j=exf
11. k=dxf
12. l=jxk
13. m=4xl
14. n=3xm
15. o=nxi
16. p=oxh
17. q=pxq
End
Grain Packing
• Apply fine grain first
– Higher degree of parallelism
• Combine multiple fine-grain nodes into a coarse-
grain node if
– Reduces communication latency
– Reduces scheduling overhead
• Fine-grain operations within a coarse-grain node
are assigned to the same processor
– Reduces interprocessor communication
– Reduces parallelism
Grain Scheduling

Idle Time

Communication
Delay

Optimal grain size  Shortest schedule


Summary
Four steps:
1. Construct a fine-grain program graph
2. Schedule the fine-grain computation
3. Perform grain packing to produce the
coarse grains
4. Generate a parallel schedule based on
the packed graph
Static Scheduling
• Grain packing may not always produce a
shorter schedule
• Static scheduling mechanisms can be
used to reduce the schedule
• Node duplication is a method that can
reduce interprocessor communication
A Packed Schedule

• Contains idle times and long interprocessor communication delays


Node Duplication

• Extra processing using processor idle times


reduces longer communication delays
Comparison

The schedule is almost 50% shorter


Grain packing and node duplication can be used together to reduce the schedule
Example
Grain Size Calculation
Fine Grain Program Graph
Calculation of Communication
Delay

d = T1 + T2 + T3 + T4 + T5 + T6
d = 20 + 20 + 32 + 20 + 20 + 100
d = 212 cycles

T3: 32-bit at 20 Mbps link


T6: Software protocol overhead
(assume 5 move instructions)
Sequential Schedule
Parallel Schedule
Speedup

Speedup Factor = 864 / 741 = 1.16


Grain Packing
Schedule for Packed Program

Speedup Factor = 864 / 446 = 1.94


Data Flow Computers
Program Flow Mechanisms
• Control flow mechanism
– Conventional computers
• Program counter sequences the execution of the instructions
– Sequential in nature
• Data flow mechanism
– Data driven
• The execution of instructions is driven by data availability
• Reduction mechanism
– Demand driven
• An instruction is executed based on the demand for its
results
Data Flow Computers
• The instructions are not ordered
– No program counter or control sequencer
• All instructions must be ready whenever
the operands (data) are ready
– Natural parallelism capability
– Fine grain parallelism in instruction level
• Data is not stored in shared memory
• Requires mechanisms to detect data
availability and to match it with instructions
A Dataflow Architecture
• Several interconnected processing
elements
• A tagged token architecture
– Each data (as a token) is tagged with
reference to instruction that needs it
– Instructions are stored in program memory
– Token-matching mechanism dispatches
instructions whose data is ready
The Global Architecture

Tagged data tokens are passed to PEs through the routing network
Inside a Processing Element
Example

•A sample program and its dataflow graph


•A dataflow graph is similar to dependence graph or program graph
•Shows data tokens passed through the edges of the graph
Sequential Execution

add: 1 cycle
Assume: multiply: 2 cycles
divide: 3 cycles
Data-driven Execution
Parallel Execution in Shared
Memory Multiprocessor

S & t: Intermediate values passed through shared memory, not needed in dataflow case
Demand-Driven Mechanisms
• Computation is triggered by the demand for an
operation’s result
• Known as “reduction machine”
• Example:
a = ((b + 1) x c – (d / e))
– In data-driven approach the operations start from the inner most
operations (bottom-up):
• First + and /
• Then x
• And finally -
– In demand-driven approach the operations start from the outer
most operations (top-down):
• First –
• Then x and /
• And finally +
Reduction Machine Models
• String reduction:
– Each demander gets a separate copy of the expression for its
own evaluation
– A long string expression is reduced to a single value in a
recursive fashion
– Each reduction step has an operator followed by an embedded
reference to demand the corresponding input operands
– The operator is suspended while its input arguments are being
evaluated
– An expression is said to be fully reduced when all the arguments
are replaced by equivalent values
• Graph reduction:
– The expression is represented as a directed graph
– The graph is reduced by evaluation of branches or subgraphs
– Branches or subgraphs can be reduced in parallel

Вам также может понравиться