Parallel Processors: Session4 Program Partitioning and Computational Granularity

Parallel Processors
Session4
Program Partitioning
and
Computational Granularity
Grain Size
• Grain is a program segment
• Grain size is a measure of the amount of computation
– A simple measure is the number of instructions in the grain
– Determines the basic program segments for parallel
processing
Program Partitioning
• Grain Sizes:
– Fine
– Medium
– Course
Granularity & Parallelism Level
Granularity depends on the level of parallel processing
• Instruction Level
– Less than 20 instructions
– Two to thousands of parallel segments
• Loop Level
• Procedure Level
• Subprogram Level
– Thousands of instructions
• Job (Program) Level
– Tens of thousands of instructions
Levels of Parallelism
Message
Passing
Shared-Variable
Communication
Communication Latency
• A time measure of the communication overhead

between machine subsystems:
– Memory latency:
• Time needed to access memory
– Synchronization latency:
• Time needed to synchronize two processors
– In both cases a limiting factor in terms of scalability
• Memory size
• Number of processors
• Latency depends on:
– Machine Architecture
– Implementing Technology
– Communication Pattern
Granularity & Latency
• Computational Granularity and

communication latency are closely related
• For a better performance granularity and
latency must be balanced
• Techniques for
– Hiding latencies
– Minimizing communication latencies
– Optimizing grain size
Optimal Grain Size
• How can we partition a program for

shortest possible execution time?
• What is the optimal grain size?
– Number of grains
– Size of grains
– Grain scheduling
• Computation time
• Communication time
Grain Packing and Scheduling
• A method for program partitioning and

scheduling
• The objective is to produce a short
schedule for fast execution of
subdivided program modules
• A program graph is used for grain
packing
Program Graph
• Shows the structure of the program

• Similar to dependence graph
• Each node is a computational unit
• Grain size is measured by number of
basic machine cycles (processor and
memory)
Program Graph
• Nodes (n,s)
– n: node name or id
– s: the grain size of the node
• Edges (v,d)
– v: the output value from the source node which is
the input value to the destination node
– d: communication delay between the source node
and the destination node
Example
Var a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q
Begin
1. a=1
2. b=2
3. c=3
4. d=4
5. e=5
6. f=6
7. g=axb
8. h=cxd
9. i=dxe
10. j=exf
11. k=dxf
12. l=jxk
13. m=4xl
14. n=3xm
15. o=nxi
16. p=oxh
17. q=pxq
End
Grain Packing
• Apply fine grain first
– Higher degree of parallelism
• Combine multiple fine-grain nodes into a coarse-
grain node if
– Reduces communication latency
– Reduces scheduling overhead
• Fine-grain operations within a coarse-grain node
are assigned to the same processor
– Reduces interprocessor communication
– Reduces parallelism
Grain Scheduling
Idle Time
Communication
Delay
Optimal grain size  Shortest schedule

Summary
Four steps:
1. Construct a fine-grain program graph
2. Schedule the fine-grain computation
3. Perform grain packing to produce the
coarse grains
4. Generate a parallel schedule based on
the packed graph
Static Scheduling
• Grain packing may not always produce a
shorter schedule
• Static scheduling mechanisms can be
used to reduce the schedule
• Node duplication is a method that can
reduce interprocessor communication
A Packed Schedule
• Contains idle times and long interprocessor communication delays

Node Duplication
• Extra processing using processor idle times

reduces longer communication delays
Comparison
The schedule is almost 50% shorter

Grain packing and node duplication can be used together to reduce the schedule
Example
Grain Size Calculation
Fine Grain Program Graph
Calculation of Communication
Delay
d = T1 + T2 + T3 + T4 + T5 + T6
d = 20 + 20 + 32 + 20 + 20 + 100
d = 212 cycles
T3: 32-bit at 20 Mbps link

T6: Software protocol overhead
(assume 5 move instructions)
Sequential Schedule
Parallel Schedule
Speedup
Speedup Factor = 864 / 741 = 1.16

Grain Packing
Schedule for Packed Program
Speedup Factor = 864 / 446 = 1.94

Data Flow Computers
Program Flow Mechanisms
• Control flow mechanism
– Conventional computers
• Program counter sequences the execution of the instructions
– Sequential in nature
• Data flow mechanism
– Data driven
• The execution of instructions is driven by data availability
• Reduction mechanism
– Demand driven
• An instruction is executed based on the demand for its
results
Data Flow Computers
• The instructions are not ordered
– No program counter or control sequencer
• All instructions must be ready whenever
the operands (data) are ready
– Natural parallelism capability
– Fine grain parallelism in instruction level
• Data is not stored in shared memory
• Requires mechanisms to detect data
availability and to match it with instructions
A Dataflow Architecture
• Several interconnected processing
elements
• A tagged token architecture
– Each data (as a token) is tagged with
reference to instruction that needs it
– Instructions are stored in program memory
– Token-matching mechanism dispatches
instructions whose data is ready
The Global Architecture
Tagged data tokens are passed to PEs through the routing network
Inside a Processing Element
Example
•A sample program and its dataflow graph

•A dataflow graph is similar to dependence graph or program graph
•Shows data tokens passed through the edges of the graph
Sequential Execution
add: 1 cycle
Assume: multiply: 2 cycles
divide: 3 cycles
Data-driven Execution
Parallel Execution in Shared
Memory Multiprocessor
S & t: Intermediate values passed through shared memory, not needed in dataflow case
Demand-Driven Mechanisms
• Computation is triggered by the demand for an
operation’s result
• Known as “reduction machine”
• Example:
a = ((b + 1) x c – (d / e))
– In data-driven approach the operations start from the inner most
operations (bottom-up):
• First + and /
• Then x
• And finally -
– In demand-driven approach the operations start from the outer
most operations (top-down):
• First –
• Then x and /
• And finally +
Reduction Machine Models
• String reduction:
– Each demander gets a separate copy of the expression for its
own evaluation
– A long string expression is reduced to a single value in a
recursive fashion
– Each reduction step has an operator followed by an embedded
reference to demand the corresponding input operands
– The operator is suspended while its input arguments are being
evaluated
– An expression is said to be fully reduced when all the arguments
are replaced by equivalent values
• Graph reduction:
– The expression is represented as a directed graph
– The graph is reduced by evaluation of branches or subgraphs
– Branches or subgraphs can be reduced in parallel

Parallel Processors: Session4 Program Partitioning and Computational Granularity

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Parallel Processors: Session4 Program Partitioning and Computational Granularity

Загружено:

Авторское право:

Доступные форматы

Parallel Processors

• A time measure of the communication overhead

• Computational Granularity and

• How can we partition a program for

• A method for program partitioning and

• Shows the structure of the program

Optimal grain size  Shortest schedule

• Contains idle times and long interprocessor communication delays

• Extra processing using processor idle times

The schedule is almost 50% shorter

T3: 32-bit at 20 Mbps link

Speedup Factor = 864 / 741 = 1.16

Speedup Factor = 864 / 446 = 1.94

•A sample program and its dataflow graph

Вам также может понравиться