Академический Документы
Профессиональный Документы
Культура Документы
Session4
Program Partitioning
and
Computational Granularity
Grain Size
• Grain is a program segment
• Grain size is a measure of the amount of computation
– A simple measure is the number of instructions in the grain
– Determines the basic program segments for parallel
processing
Program Partitioning
• Grain Sizes:
– Fine
– Medium
– Course
Granularity & Parallelism Level
Granularity depends on the level of parallel processing
• Instruction Level
– Less than 20 instructions
– Two to thousands of parallel segments
• Loop Level
– Less than 500 instructions
• Procedure Level
– Less than 2000 instructions
• Subprogram Level
– Thousands of instructions
• Job (Program) Level
– Tens of thousands of instructions
Levels of Parallelism
Message
Passing
Shared-Variable
Communication
Communication Latency
• Nodes (n,s)
– n: node name or id
– s: the grain size of the node
• Edges (v,d)
– v: the output value from the source node which is
the input value to the destination node
– d: communication delay between the source node
and the destination node
Example
Var a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q
Begin
1. a=1
2. b=2
3. c=3
4. d=4
5. e=5
6. f=6
7. g=axb
8. h=cxd
9. i=dxe
10. j=exf
11. k=dxf
12. l=jxk
13. m=4xl
14. n=3xm
15. o=nxi
16. p=oxh
17. q=pxq
End
Grain Packing
• Apply fine grain first
– Higher degree of parallelism
• Combine multiple fine-grain nodes into a coarse-
grain node if
– Reduces communication latency
– Reduces scheduling overhead
• Fine-grain operations within a coarse-grain node
are assigned to the same processor
– Reduces interprocessor communication
– Reduces parallelism
Grain Scheduling
Idle Time
Communication
Delay
d = T1 + T2 + T3 + T4 + T5 + T6
d = 20 + 20 + 32 + 20 + 20 + 100
d = 212 cycles
Tagged data tokens are passed to PEs through the routing network
Inside a Processing Element
Example
add: 1 cycle
Assume: multiply: 2 cycles
divide: 3 cycles
Data-driven Execution
Parallel Execution in Shared
Memory Multiprocessor
S & t: Intermediate values passed through shared memory, not needed in dataflow case
Demand-Driven Mechanisms
• Computation is triggered by the demand for an
operation’s result
• Known as “reduction machine”
• Example:
a = ((b + 1) x c – (d / e))
– In data-driven approach the operations start from the inner most
operations (bottom-up):
• First + and /
• Then x
• And finally -
– In demand-driven approach the operations start from the outer
most operations (top-down):
• First –
• Then x and /
• And finally +
Reduction Machine Models
• String reduction:
– Each demander gets a separate copy of the expression for its
own evaluation
– A long string expression is reduced to a single value in a
recursive fashion
– Each reduction step has an operator followed by an embedded
reference to demand the corresponding input operands
– The operator is suspended while its input arguments are being
evaluated
– An expression is said to be fully reduced when all the arguments
are replaced by equivalent values
• Graph reduction:
– The expression is represented as a directed graph
– The graph is reduced by evaluation of branches or subgraphs
– Branches or subgraphs can be reduced in parallel