Task N Data Schedulling

Parallel Computing 32 (2006) 759–774
www.elsevier.com/locate/parco
An improved two-step algorithm for task and data

parallel scheduling in distributed memory machines
a,*
Savina Bansal , Padam Kumar b, Kuldip Singh b
a
Department of Electronics Engineering, GZS College of Engineering and Technology, Bathinda, Punjab, India
b
Department of E&CE, Indian Institute of Technology Roorkee, Roorkee, Uttranchal, India
Received 7 January 2006; received in revised form 4 July 2006; accepted 29 August 2006
Available online 19 October 2006
Abstract
Scheduling of most of the parallel scientific applications demand simultaneous exploitation of task and data parallelism
for efficient and effective utilization of system and other resources. Traditional optimization techniques, like optimal con-
trol-theoretic approaches, convex-programming, and bin-packing, have been suggested in the literature for dealing with
the most critical processor allocation phase. However, their application onto the real world problems is not straightfor-
ward, which departs the solutions away from optimality. Heuristic based approaches, in contrast, work in the integer
domain for the number of processors all through, and perform appreciably well. A two-step Modified Critical Path and
Area-based (MCPA) scheduling heuristic is developed which targets at improving the processor allocation phase of an
existing Critical Path and Area-based (CPA) scheduling algorithm. Strength of the suggested algorithm lies in bridging
the gap between the processor allocation and task assignment phases of scheduling. It helps in making better processor
allocations for data parallel tasks without sacrificing the essential task parallelism available in the application program.
Performance of MCPA algorithm, in terms of normalized schedule length and speedup, is evaluated for random and real
application task graph suites. It turns out to be much better than the parent CPA algorithm and comparable to the high
complexity Critical Path Reduction (CPR) algorithm.
2006 Elsevier B.V. All rights reserved.
Keywords: Mixed parallelism; Scheduling algorithm; Distributed systems; Data parallelism; Parallelizable tasks
1. Introduction
Scheduling is one of the most vital design issues in parallel and distributed computing systems affecting the
overall execution time of an application. Simultaneous exploitation of task (or function) and data parallelism
is relatively a new trend steadily taking shapes for high performance parallel computing. Merits of this
approach have been emphasized for solving many scientific and engineering applications that involve extensive
*
Corresponding author. Tel.: +91 164 2281954.
E-mail addresses: savina_bansal@indiatimes.com (S. Bansal), padam_kumar@hotmail.com (P. Kumar), kds56fec@iitr.ernet.in (K.
Singh).
0167-8191/$ - see front matter 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.parco.2006.08.004
760 S. Bansal et al. / Parallel Computing 32 (2006) 759–774
array handling and manipulations such as image processing, computer vision, climate modeling and many
other numerical applications [3,19,20,23]. Potential benefits of this approach have fascinated many researchers
for solving issues related to it over the last decade. Most of the research efforts are, however, aimed at pro-
viding software based compatibility [5,7,10,19,20] leading towards the integration of task or data parallel
directives into the existing pure data and task parallel languages and compilers; as in High Performance For-
tran [7,19], Fortran-M [5], Fx compiler [22,24], PARADIGM [19], and Orca [10]. However, along with the
software and system based support, development of high performance scheduling heuristics is equally impor-
tant to extract maximum benefits in mixed parallelism environment.
Mixed parallelism relates to the presence of task as well as data parallelism in a parallel program. The exis-
tence of task parallelism allows multiple tasks to be executed concurrently (MPMD style); whereas data par-
allelism permits individual tasks to be executed in a data parallel manner (SPMD style) on disjoint set of
processors. A program exhibiting mixed parallelism, thus, executes in a Multiple-SPMD style. Parallel appli-
cation programs are generally represented by macro dataflow graphs (MDG) with vertices (ti) representing
coarse grain tasks and edges indicating control/data dependencies among them as shown in Fig. 1. Precedence
level (Prec_level) for a task ti = a (a P 0), if all of its predecessors are at prec_level < a and at least one of its
predecessors is at Prec_level ‘a 1’. Tasks in a given Prec_level have no precedence constraints among them
and hence can be executed concurrently.
In the pure task-scheduling problems [12,21], every node (also termed as S-task) is to be executed on a sin-
gle processor only, whereas in mixed parallel scheduling [1,2,4,15–17], a node (also termed as M-task) can be
executed on multiple processors in a data parallel manner. An MDG, in mixed parallelism scheduling, may
comprise of only M-tasks or a mix of S-tasks and M-tasks. Scheduling of such task graphs is termed as M-
task scheduling, which is a well-known NP-complete problem [6]. The objective of the problem is to allocate
and assign processors to task nodes and to suggest a time order of execution so as to minimize overall execu-
tion time (makespan (x)) of an application. Keeping in view the advantages of mixed parallel scheduling, in a
more recent work [25], an attempt is made to extend existing task parallel scheduling algorithms for mixed
parallel scheduling environment.
M-task scheduling is generally done by using a two-step or a one-step approach. In the former, the sched-
uling problem is divided into two simple sub-problems so as to save on the complexity front; (i) processor allo-
cation (for the M-tasks); and (ii) task assignment (scheduling of tasks on allocated number of processors);
whereas, in the latter approach, processor allocation and task assignment is done side-by-side as in [16] where,
the two phases work in complete unison and each and every processor allocation is confirmed only after ensur-
ing its profitability on makespan through actual task assignments. It results in better performance though, at
the cost of increase in complexity (O(ev2p + v3p log v + v3p2 log p)) (where e and v are the number of edges and
vertices in the task graph; and p is the total number of processors in the system).
For mixed parallel scheduling algorithms, optimum exploitation of data and task parallelism is quite essen-
tial and task parallelism cannot be ignored while dealing with data parallel M-tasks. In the available two-step
algorithms, such as Two Step Allocation and Scheduling algorithm (TSAS) [18] and Critical Path and Area-
based algorithm (CPA) [15] (discussed in Section 3), processor allocation phase works more or less in isolation
with the task assignment phase. It concentrates too much on optimizing data parallelism that the possibility of
t1 Prec_level=0
t2 t5 t3 t4 Prec_level=1
t6 t7 Prec_level=2
Prec_level=3
t8
Fig. 1. A typical MDG for the matrix multiplication application.
S. Bansal et al. / Parallel Computing 32 (2006) 759–774 761
exploiting task parallelism gets sidelined at times. As a result, an algorithm may loose its track, in spite of
making a perfect start with an optimal processor allocation. We support this conjecture in the following sec-
tions that motivated us to develop the Modified Critical Path and Area-based algorithm (MCPA), which suc-
ceeds in bridging this gap to an appreciable extent. In the MCPA algorithm, essential task parallelism is
preserved, at the cost of data parallelism sometimes, if it is crucial for the overall performance of the algo-
rithm, and the low-complexity feature of two-step algorithms is also retained. The paper is organized as fol-
lows: the following section formulates the problem and describes some of the significant terms and expressions
used in the work. Section 3 discusses related work and motivation behind the proposed algorithm, which has
been expounded in the section following it. Sections 5 and 6 presents the experimental set-up and performance
results in comparison to state-of-the-art M-task scheduling algorithms. Section 7 discusses and concludes the
work.
2. Problem formulation
The basic application or task model is represented by a quadruple Q(T, R, [cij], [w(i, ni)]), where
T={ti : i = 1, 2, . . . , v} is a set v of tasks/nodes; R represents a relation that defines a partial order on the task
set such that if tiRtj then task ti must finish before tj can start execution; [cij] is a v · v matrix giving the com-
munication cost (depending on network characteristics and volume of data involved) between task ti and tj;
and [w(i, ni)] for 1 6 ni 6 p is a v · p matrix (p being the total number of processors in the system), which rep-
resents execution cost of task (ti) for different number of processors allocated (ni) to them (Table 1). It may be
seen that speedup improvement, with increasing number of processors allocated to a node, is not linear but
convex due to the presence of various overheads and sequential part of computation ðwsi Þ. Further, commu-
nication cost is not only a function of dataflow involved but also of data distribution overheads; the latter is,
however, taken as the part of computation cost of concerned nodes as elaborated in [19].
An MDG (Fig. 1) can better represent the application model, with vertices representing coarse grain macro
nodes that can be executed in a data parallel manner on disjoint set of processors and edges the dataflow paths
signifying data/control dependencies among the nodes. Many researchers have used such a task model and its
practicability is substantiated in [19], which deals with automatic extraction of macro dataflow graphs from an
extended HPF or MATLAB program. The techniques for estimating computation and communication costs
at compile-time are dealt in [8,9] and have been implemented as part of PARADIGM compiler. Communica-
tion costs are assumed non-negligible, even if two macro nodes are scheduled on the same set of processors due
to data redistribution costs that might be involved. All the tasks are preceded and succeeded by a source and a
sink node, which are responsible for distributing and collecting data in the beginning and at the end of the
program, respectively. Communication costs from these nodes are also considered due to the possible data
distribution or redistribution costs involved in the mixed parallel scenario. Communication links are assumed
to be contention free.
Computation costs for the data parallel nodes can be obtained through estimation or profiling. Amdahl’s
law provides good cost estimations for a node for different number of processors allocated to it. In the cost
profiling method, costs are obtained by actually measuring them as a function of the number of processors
in sample program runs, and then, using linear regression method to fit these values to a function of the form
Table 1
Computation cost matrix w(i, ni) for the MDG shown in Fig. 1
ti ni
1 2 3 4 5 6 7 8
2 ðwsi
¼ 20; wpi ¼ 240Þ 260 140 100 80 68 60 54.3 50
3 ðwsi
¼ 20; wpi ¼ 240Þ 260 140 100 80 68 60 54.3 50
4 ðwsi
¼ 20; wpi ¼ 240Þ 260 140 100 80 68 60 54.3 50
5 ðwsi
¼ 20; wpi ¼ 240Þ 260 140 100 80 68 60 54.3 50
6 (S-task) 7 7 7 7 7 7 7 7
7 (S-task) 7 7 7 7 7 7 7 7
as given by Eq. (1). For a task ti, executing on ni number of processors the execution cost w(i, ni), using
Amdahl’s law, is given by,

1 ai wp
wði; ni Þ ¼ ai þ wði; 1Þ ¼ wsi þ i for 0 6 ai 6 1 ð1Þ
ni ni
where wsi =wpi represent the sequential/parallel computation cost of an M-task and ai is the fraction of the data
parallel computation that executes serially.
The target machine P = {pk : k = 1, 2, . . . , p} is assumed to be comprising of ‘p’ fully connected processors.
The processors executing a single M-task may communicate among themselves to exchange data values using
a predefined set of communication primitives [20]. All the subtasks corresponding to a single M-task (ti) are
computed in unison on the assigned set of processors (denoted as Pi ) due to these internal communications.
Some of the terms that will be referred in the work are described in Fig. 2.
A node ti cannot start execution unless and until data arrives from all of its predecessors so as to honor
precedence constraints. The data arrival time (DAT) for ti is given by
DATðti Þ ¼ maxtj 2predðiÞ fF j þ cji g; ð2Þ
where pred(i) represents the set of predecessor tasks of ti.
Start time for ti (Si) is limited by its data arrival time and the ready time of the processors allocated to it, as
follows:
S i ¼ maxpk 2Pi fDATðti Þ; pR
k g ð3Þ
where the term pR k ¼ maxtx 2T ;pk 2Px fF x g represents the ready time of processor pk i.e. the finish time of last task
(tx) scheduled on it, and Px corresponds to the set of processors allocated to the task tx. Finish time for the
task ti (Fi) is given by
F i ¼ S i þ wði; ni Þ ð4Þ
Critical path length Lcp (Fig. 2) gives the theoretical lower bound for the schedule length generated and can be
reduced by allocating more processors to data parallel tasks. In addition, for M-task scheduling, another
parameter that limits schedule length is the average processor computational area Ap that represents proces-
sor-time area required by the MDG. It is a measure of processor utilization and is given by
1X
Ap ¼ wði; ni Þ ni ð5Þ
p ti 2T
Allocation of more processors to the critical path nodes tend to reduce Lcp; whereas, Ap begins to increase with
the allocation of more processors to the data parallel tasks. As a consequence, the theoretical lower bound on
the M-task scheduling algorithms is given by,
Bottom_level_cost: For a task t i , it’s bottom_level_cost ( bi ) is defined as the sum of computation and
communication costs along the longest directed path from the concerned node to the sink node.
Top_level_cost: For a task t i , it’s top_level_cost ( τ i ) is defined as the sum of computation and
communication costs along the longest directed path from the entry node to the concerned node
(excluding its computation cost).
Critical Path (CP): Critical path is the longest directed path in the MDG taking into consideration the
computation and communication costs. The critical path length ( Lcp ) puts a worst-case lower bound on
the makespan, and is defined as Lcp = max{bi + τ i } . It may vary during the course of scheduling due to
t i ∈T
the involvement of communication costs. A node lying on the critical path is termed as critical path node.
Fig. 2. Some frequently used terms in scheduling.

h ¼ maxfLcp ; Ap g ð6Þ
that needs to be minimized. Scheduling algorithms, as suggested from time to time, try to work out for an
optimal processor allocation so as to minimize h (Eq. (6)) within these two conflicting constraints. The objec-
tive function of M-task scheduling is to allocate and assign processors to the nodes and to suggest a time order
of execution on them, so that the overall execution time (x) of the application, which corresponds to the max-
imum finish time of a task in the scheduled DAG, is minimized
x ¼ maxfF i gti 2T ð7Þ
3. Related work and motivation
Scheduling with mixed task and data parallelism is a relatively new trend in scheduling with most of the
work focused around specific task graph topologies like series-parallel, pipelined, divide and conquer, and
tree [14,20,23]. In one of the earlier works [2], an approximation algorithm for scheduling parallelizable
2
independent tasks was suggested with performance bound as x6 1þ1=p xopt . The work was later extended
[1] to schedule dependent tasks using the two-step approach. Another approximation algorithm for sched-
uling independent M-tasks was developed in [26], which guarantees the makespan to be within twice the
optimal. For the special case of tree structured precedence constrained
pffiffiffi M-tasks, there exist an approxima-
tion algorithm [13] with a performance guarantee of ð3 þ 5Þ=2. The communications between the
M-tasks is, however, neglected as the granularity in M-task scheduling is generally large. Among the more
general approaches that deal with communication costs and arbitrary task graph precedence, a Two Step
Allocation and Scheduling (TSAS) algorithm, based upon the two-step approach used in [1,2,26], was sug-
gested in [18]. In the first step, it employs a Convex Programming Allocation Algorithm (CPAA), which
minimizes h (Eq. (6)) to obtain an optimal processor allocation (in real number domain) for the data par-
allel tasks. In the second step, a Prioritized Scheduling Algorithm (PSA) is used to list-schedule the nodes
on the allocated number of processors, which become available at the earliest. However, the real number
solution to the processor allocation problem forced the authors [18] to round off the numbers to a near
integer value departing the solutions from optimality. The time complexity for the TSAS algorithm is
O(v2.5 + vp log p).
Critical Path Reduction (CPR) algorithm [16], in contrast, is a one-step greedy iterative algorithm that
deals with the integral number of processors at the very beginning. It starts with a single processor allocation
to every task (ni = 1, "1 6 i 6 v), and computes makespan using the traditional list-scheduling approach that
assigns highest priority (top_level + bottom_level) ready node (node with satisfied precedence constraints) to
the first available processors in the system. Next, it increments the processor allocation for the most crucial
critical path node (one that benefits the most from the processor increment) and computes the resulting
improvements on the makespan generated. An increment is accepted, if it results in improving the previously
calculated makespan, else discarded. CPR algorithm outperforms many other algorithms [15,18,20] due to its
one-step approach that offers better decision-making opportunity, since exact information about makespan is
available at hand while doing processor allocations. However, recomputation of makespan at every step
makes the algorithm quite complex (O(ev2p + v3p log v + v3p2 log p)).
To save on the complexity front, Critical Path and Area-based (CPA) algorithm [15] (pseudo code given in
Fig. 3) employed the two-step approach of TSAS algorithm. However, instead of using the complex convex
programming approach, the greedy processor allocation heuristic of CPR algorithm is used. In the processor
allocation phase, initially all processors are allocated one processor. After that, the critical path task returning
maximum benefits of data parallelism, as reflected by the higher gain parameter G in Fig. 3, is selected for
higher processor allocation within the constraints imposed by theoretical lower bound h. Gain parameter
G reflects the benefit in terms of reduction in computational cost that can be achieved by allocating an extra
processor to a data parallel task. By comparing Gain parameter of two different M-tasks, algorithm tends to
find out the most appropriate node that should be extended the benefits of more processor allocation. It is
worth mentioning that due to serial computation cost (wsi ) present in all data parallel tasks, allocation of addi-
tional processors does not return the same benefit; but, the benefit starts diminishing due to this convex
speedup.
CPA ()
Input: Task Graph model (T), Processor Graph model (P)
Output: Mapping and assignment of tasks on processors
Begin
Obtain processor allocation using Proc_alloc ( ); // phase 1
Schedule the tasks using Task_sch ( ); // phase 2
End
-------------------------------------------------------------------------------------------------------------------------
Proc_alloc()
Begin
{
for all ti ∈ T do
Step 1: ni ← 1 ; end forall
Step 2: while ( Lcp > Ap ) do
w( i, ni ) w (i, ni +1 )
Step 3: ti ← CP task such that ni < p and the gain parameter G = − is maximized;
ni ni + 1
Step 4: ni ← ni + 1;
Step 5: Recompute top_level_cost and bottom_level_cost;
endwhile
}
End
-----------------------------------------------------------------------------
Task_sch( )
Begin
Sort tasks in decreasing order of bottom_level_cost priority;
while (not all tasks are scheduled) do
Schedule ti on the first ni processors becoming free;
endwhile
End
Fig. 3. Pseudo code of CPA algorithm.
In the second task scheduling phase, highest priority (bottom_level_cost) ready node is scheduled on its allo-
cated number of processors that are first available. The complexity of the algorithm is O(v2p + evp) and per-
formance is shown to be within 50% to the CPR algorithm for some real and synthetic applications task
graphs. The low-complexity feature of two-step algorithms is noteworthy.
However, we feel that the first (processor allocation) phase, in these algorithms, works more or less in iso-
lation with the second (task scheduling) phase. For example in Steps 3 and 4, CPA algorithm may allocate all
the available processor to a given task for better returns, as it works keeping in view the local objective func-
tion that is to exploit maximum data parallelism. It concentrates too much on optimizing data parallelism that
the possibility of exploiting task parallelism in the second phase may gets sidelined. As a result, the algorithm
may loose its track, in spite of making a perfect start with an accurate processor allocation phase. We defend
our point in the following section that motivated us to modify the processor allocation phase of the CPA algo-
rithm; an algorithm chosen because of its simpler non-convex programming approach.
3.1. Motivating example
Processor allocation phase of CPA algorithm needs some pondering for further improvements. It exploits
maximum possible data parallelism by incrementing processor allocations for the most critical M-tasks as long
as Lcp > Ap (Step 2 to Step 5 in Fig. 3). However, in the process, it overlooks the feasibility aspect of attaining
this parallelism in the second phase. We elucidate our point through the example graph (Fig. 1) that corre-
sponds to a typical matrix multiplication problem with computation cost matrix [w(i, ni)] (for p = 8) shown
in Table 1. The nodes t1 and t8 signify the source and sink nodes; nodes t2, t3, t4, and t5 correspond to the
matrix multiplication operations that can be done in a data parallel manner; t6 and t7 to the matrix addi-
tion/subtraction respectively that is done serially on a single processor.
We first take a look at the running trace of processor allocation phase (Table 2) of the CPA algorithm. In
the beginning, critical path length Lcp and average computational area Ap are 276 and 131, respectively. Since
all the four M-tasks give same value for the gain parameter G, algorithm selects t2 at random and increments
processor allocation for it; it results in increasing the average computational area to 133.5 whereas Lcp remains
unchanged; as the critical paths other than the {t1, t2, t6, t8} are still present. The process continues and all the
four M-tasks get allocated two processors each. The critical path length then, reduces to 156 and average com-
putational area jumps to 141. The process repeats further, till all the four M-tasks get allocated three proces-
sors each. Subsequently, no more processor allocation is permitted as Ap (=151) exceeds Lcp (=116). Final
processor allocation is shown in Table 3.
It may be seen that the total number of processors allocated at prec_level = 1 (Fig. 1) is 12, which is
four more than the available number of processors (p = 8) in the system. As a result, in the second phase
of the algorithm (task_sch( )), task t2 and t3 will consume three processors each that are first available, and
tasks t4 and t5, though they could start concurrently with these tasks (at the same prec_level), shall have to
wait for these processors to become free, as shown in the schedule shown in Fig. 4 (x = 216). The width of
each task block in Fig. 4 represents the number of processors allocated to it, and height represents the
execution time. It shows that the CPA algorithm is unable to exploit enough task parallelism, even if it
is available in the graph. Consequently, the benefits of exploiting more data parallelism (by allocating more
processors to the tasks) in the first phase get nullified in the second phase due to the non-availability of
processors. This stimulates the need for an algorithm that allocates processors keeping a close watch on
the crucial task parallelism and available resources, and still maintains the low complexity feature of
two-phase algorithms.
4. MCPA algorithm
A two-step Modified Critical Path and Area based algorithm MCPA (pseudo code shown in Fig. 5) is
developed for scheduling arbitrary task graphs composed of parallelizable M-tasks. It modifies the processor
allocation phase of the CPA algorithm and tends to bridge the gap between the two phases for better schedule
lengths especially with the increasing number of processors available in the system.
4.1. Processor allocation phase
This phase starts by allocating single processor to every task in the DAG and marking them unvisited ini-
tially. As long as Lcp > Ap, computation cost of the critical path nodes influences the makespan. Consequently,
algorithm tries to speedup the execution of these nodes by allocating higher number of processor to them.
However, unlike the CPA algorithm, a task ti is considered ‘suitable’ for allocating more processors if and only
if it satisfies the following two conditions:
Table 2
Running trace of processor allocation phase of CPA algorithm
Step Lcp Ap Number of processors (ni) allocated Maximum gain node selected (ti) Number of processors (ni) allocated
(before) (after)
t1 t2 t3 t4 t5 t6 t7 t8 t1 t2 t3 t4 t5 t6 t7 t8
1 276 131 1 1 1 1 1 1 1 1 t2 1 2 1 1 1 1 1 1
2 276 133.5 1 2 1 1 1 1 1 1 t3 1 2 2 1 1 1 1 1
3 276 136 1 2 2 1 1 1 1 1 t4 1 2 2 2 1 1 1 1
4 276 138.5 1 2 2 2 1 1 1 1 t5 1 2 2 2 2 1 1 1
5 156 141 1 2 2 2 2 1 1 1 t2 1 3 2 2 2 1 1 1
6 156 143.5 1 3 2 2 2 1 1 1 t3 1 3 3 2 2 1 1 1
7 156 146 1 3 3 2 2 1 1 1 t4 1 3 3 3 2 1 1 1
8 156 148.5 1 3 3 3 2 1 1 1 t5 1 3 3 3 3 1 1 1
9 116 151 Not allowed
Table 3
Processors allocation with p = 8 for Fig. 1
Algorithm Task
t1 t2 t3 t4 t5 t6 t7 t8
CPA 1 3 3 3 3 1 1 1
MCPA 1 2 2 2 2 1 1 1
Fig. 4. Schedule generated by CPA algorithm.
(1) It is a critical path task i.e. bi + si = Lcp.

(2) Number of processors already allocated to the critical path tasks (if any), at this precedence level ‘i are
less than p, the number of processors available in the system.
The second condition is required, as there could be more than one critical path nodes at the same prec_level,
as was the case with the problem discussed above. Concurrent execution of these nodes becomes quite essential
for the success of an algorithm. These nodes, thus, indicate the crucial task parallelism available in the task
graph, which must be retained even if at the cost of sacrificing some data parallelism. CPA algorithm over-
looks this observation and keeps on allocating processors only to gain much lesser (due to the diminishing
returns with higher processor allocation). MCPA algorithm, in contrast, works towards retaining this crucial
task parallelism by keeping in sight the number of available processors and those that have already been allo-
cated to the critical path tasks at the particular prec_level (Step 5 in Fig. 5). As the algorithm works with arbi-
trary precedence and limited number of processors so our main concern is to provide a fair chance to all the
‘critical tasks’ for using these limited resources.
An optimal node, which returns maximum benefits (as reflected from the gain parameter G) of data paral-
lelism, is next searched among the ‘suitable’ nodes. Number of processors allocated to this optimal task is then
incremented by one. The top and bottom level costs of the affected tasks are then recomputed due to change in
the computation cost of this task. The process is repeated till (Lcp > Ap). The final processor allocation is
shown in Table 3.
It may be noted that the concept of ‘prec_level’ (Step 5 in Fig. 5) comes into picture only after it is found
that all the tasks at a given ‘prec_level’ are equally ‘critical’ (Step 3 in Fig. 5); and since all the tasks at a given
‘level’ can be executed in parallel, so, the algorithm gives all of them an equal opportunity to utilize the avail-
able resources. In case, all tasks at a given ‘prec_level’ are not critical (less regular graphs) then the processors
may not be shared equally but, preference shall be given to the other critical nodes. Running trace of this
phase for the same problem (Fig. 1) is shown in Table 4. Processor allocation stops at Step 5, the moment
all critical path nodes at prec_level (=1) get allocated two processors each (with a total sum of 8, the number
of processors available in the system). This is in spite of the fact that there still exists a possibility of further
data parallelism exploitation, since (Lcp > Ap). The final processor allocation using MCPA algorithm is shown
in Table 3.
MCPA ()
Input: Task Graph (T), Processor Graph (P)
Output: Mapping and assignment of tasks ∈ {T} on processors ∈ {P}
Begin
Obtain processor allocation using Proc_alloc ( ). // phase 1
Schedule the tasks using Task_sch ( ). // phase 2
End
---------------------------------------------------------------------------------------------------------------------------
Proc_alloc()
Begin
for (all tasks ti ∈ T )
Step 1: ni ← 1;
Step 2: Mark ti as unvisited; endfor
while ( Lcp > Ap )
Step 3: ← Get the set of critical nodes;
Step 4: Mark all ti ∈ as visited;
Step 5: for all ti ∈
N i ← Get the number of processors allocated at prec_level i for the visited tasks;
w( i, ni ) w (i, ni + 1)
Step 6: topt ← Optimal task ti ∈ such that ( N i < p ) and gain G = ( − ) is maximized;
ni ni + 1
Step 7: nopt ← nopt + 1;
Step 8: Mark all ti ∈ except topt as unvisited;
Step 9: Modify bottom_level_costs and top_ level_costs of affected tasks;
endwhile
End
---------------------------------------------------------------------------------------------------------------------------
Task_sch( )
Begin
Sort tasks in decreasing order of bottom_level_cost priority;
while (not all tasks are scheduled) do
Schedule ti on the first ni processors becoming free;
endwhile
End
Fig. 5. Pseudo code for the MCPA algorithm.
Table 4
Running trace of processor allocation phase of MCPA algorithm
Step Lcp Ap Maximum P_level (‘i) No. of critical Processors already Processor ni (before) ni (after)
gain node (ti) nodes at (‘i) allocated at (‘i) availability at (‘i)
1 276 131 t2 1 4 4 Yes 1 2
2 276 133.5 t3 1 4 5 Yes 1 2
3 276 136 t4 1 4 6 Yes 1 2
4 276 138.5 t5 1 4 7 Yes 1 2
5 156 141 t2 1 4 8 No Not allowed
4.2. Task scheduling phase
In the second phase, tasks are scheduled using a list-scheduling approach with priorities being decided on
the basis of bottom_level_cost; so as to have a fair comparison with the similar priority based CPA algorithm.
The highest priority ready node is assigned to its allocated number of processors (calculated in first phase) that
become free first. The start and finish times on the assigned processors is then calculated using the expressions
given in Section 2. The schedule generated by the algorithm is shown in Fig. 6. It may be seen that the pro-
posed heuristic succeeds in doing a more balanced processor allocation (may be at the cost of giving up some
Fig. 6. Schedule generated by MCPA algorithm.
data parallelism) in the first phase, which allows all the four tasks at prec_level = 1 to execute concurrently in
the second phase, resulting in a schedule length (x = 156) much better than that of the CPA algorithm
(x = 216).
Complexity of the algorithm is calculated as follows: the main time consuming loop in the first phase is
while loop that may get repeated for at the most v · p times. Various priority levels can be computed in
O(v + e) in a depth search manner (e being the number of edges in the task graph). Once the priorities are
calculated, the critical path task set can be generated in O(v). The inner for loop (Step 5) can be repeated
at the most v times with the complexity of processors calculation at the given precedence level as O(W), where
W indicates the width or the maximum number of nodes at any prec_level in the DAG. As a result, the worst
case complexity of processor allocation phase is O(vp(vW + e)). For the second phase, the priorities can be
calculated in O(v + e) and the sorting may be done in O(v log v). Task scheduling in the ‘while’ loop takes
O(vp); thereby, the overall worst case complexity for the MCPA algorithm comes out to be O(vp(vW + e)),
which is marginally higher than that of CPA O(v2p + evp) owing to the presence of task graph parameter
W. The task graph structure plays a crucial role; for the problems possessing higher ‘task’ parallelism, MCPA
algorithm’s improvement shall come at somewhat higher cost; whereas, for the densely connected graphs (with
e v2), the cost of complexity shall be same (O(pv3)) for both CPA and MCPA algorithms. However, in com-
parison to CPR algorithm (O(ev2p + v3p log v + v3p2 log p)), complexity of MCPA is lower by an order at least.
5. Experimental set-up
Performance of the MCPA algorithm is evaluated against a set of task graphs taken from [15,16]1 that were
used for comparing various M-task scheduling algorithms. The task graph suite consists of two real world
applications i.e. Matrix Multiplication (Matmul) with matrix sizes taken as (32 · 32, 64 · 64, and
128 · 128) and Strassen Matrix Multiplication (Strassen) with matrix sizes taken as 32 · 32, 64 · 64,
128 · 128, and 512 · 512. Strassen algorithm substitutes multiplications by addition, thereby reducing the
number of multiplications to be computed, resulting in a complexity lower than the classical O(M3), for a
(M · M) matrix. The computation and communication costs for these applications graphs was borrowed from
the authors of [15,16] who estimated these by running the applications on an actual cluster of workstations. In
addition, a synthetic task graph suite, along with the computation and communication costs, used in the eval-
uation of CPA algorithm was also taken from the same authors. It consists of elementary structures like but-
terfly, tree, and diamond (Fig. 7) and 10 randomly generated task graphs. Number of nodes in these synthetic
graphs varies from 9 to 22, and Communication to Computation cost Ratio CCR 6 0.2 (CCR less than 1
reflects the coarse grained nature of task graphs, as is generally the case with mixedPparallelism scheduling
Pv1 v1
problems), and the serial fraction a (as referred in Eq. (1)) =0.2, with a ¼ i¼0 wsi = i¼0 wði; 1Þ. Number of
processors (p) in the system varies from 2 to 64.
Performance metrics adopted are Normalized Schedule Length (NSL) and Speedup (ratio of makespans
generated by the algorithm for a multiprocessor and a uniprocessor system). NSL measures schedule length
1
We are thankful to the authors for providing the implementation of their algorithms and task graphs.
Diamond Tree Butterfly
Matmul
Strassen
Fig. 7. Some of the benchmark MDGs used for performance comparison.
normalized with respect to CPR, which is one of the best available algorithms in terms of makespan. NSL
greater/lesser than one directly reflects the degradation/improvement suffered by an algorithm with respect
to the CPR algorithm. An algorithm generating higher Speedup and lower NSL (61), within reasonable time
constraints, is much sought-after for the optimum utilization of resources. Algorithms chosen for comparison
are TSAS [18], CPA [15] and CPR [16], as all of these algorithms have been designed for arbitrary task graphs,
and further, the TSAS and CPA algorithms are based on a similar two-step approach. The comparison with
respect to other algorithms such as TwoL [20], TASK (ni = 1 "ti 2 T), and DATA (ni = p "ti 2 T) algorithms
has not been undertaken as, on an average, the CPR algorithm has been shown to outperform them all by a
considerable margin for the tested graph suites [16]. A comparison with CPR algorithm thus provides an indi-
rect comparison to them as well.
A comparison with the CPA algorithm is rather obvious, as it quantifies the benefits of modified processor
allocation scheme on the overall performance; since, the two algorithms (CPA and MCPA) differ only in their
processor allocation phase. For the CPA and TSAS algorithms we have used the results supplied by the author
[15]. It is to be mentioned that due to the non-availability of original code (based on convex programming
approach), TSAS results were generated by GENOCOP non-linear solver [11]. The CPR algorithm was imple-
mented at our own end, as results were not available for the complete graph suite. The task and processor
graph parameters are fed as inputs to the scheduling algorithm and allocation and assignment of tasks on dif-
ferent processors is available as the output.
6. Performance comparison
6.1. Matmul application
Simulated performance results, for the Matmul applications, are presented in Figs. 8 and 9. These results
are very much in favor of MCPA algorithm for different matrix sizes and number of processors in the system.
For the Matmul application (32 · 32, 64 · 64 and 128 · 128), in comparison to parent CPA algorithm, sche-
dule length improvement of MCPA algorithm ((xCPA xMCPA)/xCPA) on an average (over different network
sizes) is 30%, 29%, and 20%, respectively. Further, the schedule length degradation suffered by MCPA with
respect to CPR algorithm is zero, as the algorithm succeeds in generating matching schedule lengths for all
the network and matrix sizes.
2 2 2
Matmul (32x32) Matmul(64x64) Matmul (128x128)
1.5 1.5 1.5
NSL
NSL
NSL
1 1 1
0.5 5 0.5
0 0 0
2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64
No. of Processors No. of Processors No. of Processors
CPR MCPA CPA TSAS CPR MCPA CPA TSAS CPR MCPA CPA TSAS
Fig. 8. Relative performance of MCPA algorithm for Matmul application.
10 25 20
MCPA MCPA
8 CPA
20 MCPA
CPA 15 CPA
Speedup
Speedup
Speedup
6 15
10
4 10
5
2 5
32x32 64x64 128X128
0 0 0
2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64
No. of Processors No. of Processors No. of Processors
Fig. 9. Speedup comparison for Matmul.
Performance of MCPA algorithm improves with the increasing number of processors available in the sys-
tem, and it is somewhat marginal for the lesser number of processors. For example, for the coarse grain Mat-
mul (128 · 128) application, performance improvement of MCPA over the CPA algorithm is rather small for
p = 4. Further, for p = 2 the performance of CPA and MCPA algorithm is almost comparable. It may be
explained as follows.
With the number of processors limited to just two, the possibility of exploiting data or even task parallelism
at any precedence level, gets too restricted to appreciate the spirit behind the suggested algorithm. For exam-
ple, in the Matmul graph (same as in Fig. 1), all the four-macro nodes (at p_level = 1) are equally critical and
hence, MCPA algorithm allocates them an equal share of available processors (i.e. one). It forces the algo-
rithm to behave just like a pure task parallel (TASK) algorithm unable to exploit any data parallelism, which
is essential for the success of a scheduling algorithm in the mixed parallel environment, especially for the
coarse grain graphs. As a result only marginal or no improvements are observed for coarse grained graphs
with small number of available processors in the system.
In Fig. 9, performance comparison between CPA and MCPA algorithms is given in terms of Speedup,
which helps in quantifying the improvement that is achieved by using modified processor allocation scheme
suggested in this work. Performance improvements are consistent beyond p = 4, as the potential behind the
heuristic get fully exploited with higher number of processors in the system. For Matmul (32 · 32, 64 · 64,
and 128 · 128), average speedup improvements (over all the network sizes) of the proposed algorithm are
about 46%, 44% and 29%, respectively, which is same as that of the CPR algorithm.
6.2. Strassen application
For the Strassen application, performance is shown in Figs. 10 and 11. In comparison to the CPA algo-
rithm, for the Strassen (32 · 32, 64 · 64, 128 · 128, and 512 · 512), average schedule length generated by
MCPA algorithm, is better by 12%, 16%, 12%, and 14%, respectively. While, maximum average makespan
degradation suffered with respect to CPR algorithm, for different matrix sizes, is within 4%.
2 2
Strassen 32 X 32 Strassen (64x64)
1.5 1.5
NSL
NSL
1 1
5 0.5
0 0
2 4 8 16 32 64 2 4 8 16 32 64
No. of Processors No. of Processors
CPR MCPA CPA TSAS CPR MCPA CPA TSAS
2
3
Strassen (128x128) 1.5 Strassen (512x512)
2
NSL
NSL
1 0.5
0 0
2 4 8 16 32 64 2 4 8 16 32 64
CPR MCPA CPA TSAS CPR MCPA CPA
Fig. 10. Relative performance of MCPA algorithm for Strassen application.
5 12
MCPA MCPA
CPA 10
4 CPA
8
Speedup
Speedup
3 6
4
2
2
32x32 64x64
1 0
2 4 8 16 32 64 2 4 8 16 32 64
24 10
MCPA MCPA
18 8 CPA
CPA
Speedup
Speedup
6
12
4
6 2 512x512
128x128 0
0
2 4 8 16 32 64 2 4 8 16 32 64
No of Processors No of Processors
Fig. 11. Speedup comparison for Strassen.
The corresponding average speedup improvement over CPA algorithm (Fig. 11) is 11%, 22%, 15%, and
28% respectively, and the maximum average degradation with respect to the CPR algorithm, for different
matrix sizes, is within 2%. It may be seen that the MCPA algorithm is able to provide these results with
much lower time complexity in comparison to CPR algorithm.
6.3. Synthetic and random task graphs
In Fig. 12, average performance results for the synthetically generated benchmark application graphs are
presented. These results strongly support our previous observation that MCPA algorithm is better able to allo-
cate processors beyond a certain network size. In Table 5, we have summed up the average makespan and
speedup degradations (over all network sizes) suffered by various algorithms, for different graph types, in com-
parison to CPR algorithm. For calculating average results, average makespan (say), the schedule length gen-
erated by an algorithm, for a particular network size, is divided by the schedule length of CPR algorithm and
then the average of these normalized schedule lengths for all the network sizes is taken and reported. For the
random graphs, before normalizing the schedule lengths, they are averaged over all the 10 graphs, for a par-
ticular network size, and then normalized with respect to corresponding CPR average makespan. Average of
these results, over all the network sizes, is then computed and reported. Performance improvement beyond
p = 8 is evident in almost all task graphs.
For the random task graphs, average NSL degradation suffered by TSAS, CPA and MCPA algorithms is
about 22%, 17% and 6%, respectively. The maximum makespan degradation suffered by MCPA for the tested
graph suites is about 8% and minimum 0%. For the parent CPA algorithm it comes out to be 46% and 11%,
respectively. The TSAS algorithm also does not benefit much from the optimal processor allocation that has to
be rounded-off to a near integer value in later steps. The maximum and minimum performance degradation, in
terms of makespan, suffered by the TSAS algorithm is 80% and 8%, respectively. Corresponding values for
maximum and minimum speedup degradation for the TSAS, CPA, and MCPA algorithms are; 40% and
6%; 30% and 9%; 6% and 0%, respectively.
The performance of MCPA algorithm is quite significant with minimum degradations. Performance
improvement beyond p = 8 is evident in almost all task graphs. All these observations can be summarized
as follows.
For p = 2, all algorithms generate comparable schedules, as the number of processors is too small to exploit
task or even data parallelism effectively. Further, under the worst case, when task parallelism in the task graph
is higher than p and tasks are equally critical at a given prec_level, MCPA algorithm allocates resources
equally among the nodes and thus, behaves like a pure S-task scheduling algorithm unable to exploit data par-
allelism. That is why, performance of MCPA algorithm deteriorates for p = 4 especially for coarse grain
2 2
Butterfly Diamond
1.5 1.5
NSL
NSL
1 1
0.5 0.5
0 0
2 4 8 16 32 64 2 4 8 16 32 64
2 2
Tree Random
1.5 1.5
NSL
NSL
1 1
0.5 0.5
0 0
2 4 8 16 32 64 2 4 8 16 32 64
Fig. 12. Relative performance of MCPA algorithm for synthetic benchmark graphs.
Table 5
Average makespan and speedup degradation for all network sizes w.r.t. CPR algorithm
Graph type Algorithm
TSAS (%) CPA (%) MCPA (%) TSAS (%) CPA (%) MCPA (%)
Avg. NSL degradation Avg. speedup degradation
Matmul (32 · 32) 29 46 0 21 30 0
Matmul (64 · 64) 10 44 0 6 29 0
Matmul (128 · 128) 25 29 0 16 20 0
Strassen (32 · 32) 8 20 4 7 16 2
Strassen (64 · 64) 15 23 2 11 18 2
Strassen (128 · 128) 80 18 2 40 14 2
Strassen (512 · 512) – 31 2 – 22 2
Butterfly 23 11 5 17 9 9
Diamond 26 28 7 19 20 6
Tree 36 25 8 25 18 6
Random 22 17 6 18 12 5
Minus (–) sign indicates an improvement instead of degradation.
graphs (128 · 128 and 512 · 512). However, beyond p = 8, other algorithms get trapped in exploiting more
data parallelism even if schedule length improvement is quite marginal; thereby, increasing the average com-
putational area and hence overall makespan in the end. MCPA algorithm, in contrast, may get trapped only
after taking into consideration the most effective critical task parallelism in the graph. Therefore, beyond
p = 8, MCPA algorithm tends to improve upon other algorithms as reflected in Fig 12 and Table 5.
7. Conclusions
A low complexity two-step M-task scheduling algorithm is proposed for arbitrary task graphs based on the
Modified Critical Path and Area-based processor allocation heuristic that can be employed in the first phase of
the existing two-step M-task scheduling algorithms as well. In the available two-step algorithms, processor
allocation phase works more or less in isolation, without gauging the effect of its decisions on the second
phase. As a result, these algorithms, at times, get trapped in exploiting too much data parallelism, overlooking
the feasibility of exploiting crucial task parallelism in the later steps. MCPA algorithm, in contrast, succeeds in
bridging this gap to an appreciable extent. The strength of the algorithm lies in judicious allocation of proces-
sors, which restricts exploitation of data parallelism the moment it starts interfering with the crucial task par-
allelism available in the graph, thereby achieving a more balanced processor allocation especially in a resource
constrained set-up. It preserves the essential task parallelism, at the cost of data parallelism sometimes, if it is
crucial for the overall performance of the algorithm, and still retains the low-complexity features of a multi-
step algorithm. Performance of the suggested algorithm under the stated assumptions is presented for real and
synthetic application graphs, which indicates MCPA algorithm to be a potential candidate for scheduling arbi-
trary task graphs exhibiting mixed parallelism.
References
[1] K.P. Belkhale, P. Banerjee, A scheduling algorithm for parallelizable dependent tasks, in: Proc. of Int. Parallel Processing
Symposium, Anaheim, CA, April 1991, pp. 500–506.
[2] K.P. Belkhale, P. Banerjee, An approximation algorithm for the partitionable independent task scheduling problem, in: Proc. of 19th
Int. Conference on Parallel Processing, Aug. St. Charles, IL, August 1990, pp. 72–75.
[3] S. Chakrabarti, J. Demmel, K. Yelick, Modeling the benefits of mixed data and task parallelism, in: Seventh Annual ACM
Symposium on Parallel Algorithms and Architectures, CA, July 1995, pp. 74–83.
[4] S. Chakrabarti, J. Demmel, K. Yelick, Models and scheduling algorithms for mixed data and task parallel programs, Journal of
Parallel and Distributed Computing 47 (9) (1997) 168–184.
[5] I.T. Foster, K.M. Chandy, Fortran M: a language for modular parallel programming, Journal of Parallel and Distributed Computing
26 (1) (1995) 24–35.
[6] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman and Co, 1979.
[7] T. Gross, D. O’Hallaron, J. Subhlok, Task parallelism in a high performance Fortran framework, IEEE Parallel Distributed
Technology 2 (3) (1994) 6–26.
[8] M. Gupta, P. Banerjee, Compile time estimation of communication costs on multicomputers, in: Proc. of 6th Int. Parallel Processing
Symposium, Beverley Hills, CA, March 1992, pp. 470–475.
[9] M. Gupta, Automatic data partitioning on distributed memory multicomputers, PhD thesis, Department of Computer Science,
University of Illinois, Urbana, IL, September 1992.
[10] S. Ben Hassen, H.E. Bal, C.J. Jacobs, A task and data parallel programming language based on shared objects, ACM Transactions on
Programming Languages and Systems 20 (6) (1998) 1131–1170.
[11] S. Kozieland, Z. Michalewicz, Evolutionary algorithms, homomorphous mappings and constrained parameter optimization,
Evolutionary Computation 7 (1) (1991) 19–44.
[12] Y.K. Kwok, I. Ahmad, Benchmarking and comparison of the task graph scheduling algorithms, Journal of Parallel and Distributed
Computing 59 (3) (1999) 381–422.
[13] R. Lepere, D. Trystram, G.J. Woeginger, Approximation algorithms for scheduling malleable tasks under precedence constraints,
International Journal of Found. Comp. Sci. 13 (4) (2002) 613–627.
[14] G. Prasanna, A. Agarwal, B.R. Musicus, Hierarchical compilation of macro dataflow graphs for multiprocessors with local memory,
IEEE Transactions on Parallel and Distributed Systems 5 (7) (1994) 720–736, Jul.
[15] A. Radulescu, A.J.C. van Gemund, A low cost approach towards mixed task and data parallel scheduling, in: Proc. of the 15th Int.
Conf. on Parallel Processing, Valencia, Spain, September 2001, pp. 69–76.
[16] A. Radulescu, C. Nicolescu, A.J.C. van Gemund, P.P. Jonker, CPR: mixed task and data parallel scheduling for distributed systems,
in Proc. of the 15th Int. Parallel and Distributed Processing Symposium (IPDPS), San Francisco, April 2001, pp. 39–47.
[17] S. Ramaswamy, P. Banerjee, Processor allocation and scheduling of macro dataflow graphs on distributed memory multicomputers
by the PARADIGM compiler, in: Proc. of the 22nd Int. Conf. on Parallel Processing, St. Charles, IL, August 1993, pp. 134–138.
[18] S. Ramaswamy, S. Sapatnekar, P. Banerjee, A framework for exploiting task and data parallelism on distributed memory
multicomputers, IEEE Transactions on Parallel and Distributed Systems 8 (11) (1993) 1098–1115.
[19] S. Ramaswamy, Simultaneous exploitation of task and data parallelism in regular scientific applications, PhD thesis, University of
Illinois at Urbana-Champaign, 1996.
[20] T. Rauber, G. Runger, Compiler support for task scheduling in hierarchical execution models, Journal of Systems Architecture 45 (6–
7) (1999) 483–503.
[21] H.El. Rewini, T.G. Lewis, H.H. Ali, Task Scheduling in Parallel and Distributed Systems, Prentice Hall, NJ, 1994.
[22] J. Subhlok, B. Yang, A new model for integrated nested task and data parallel programming, in: Proc. of the Sixth ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, Las Vegas, NV, June 1997, pp. 1–12.
[23] J. Subhlok, G. Vondran, Optimal use of mixed task and data parallelism for pipelined computations, Journal of Parallel and
Distributed Computing 60 (3) (2000) 297–319.
[24] J. Subhlok, J Stichnoth, D O’Hallaron, T Gross, Exploiting task and data parallelism on a multicomputer, in: Proc. of the Fourth
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May 1993, pp. 13–22.
[25] F. Suter, F. Desprez, H. Casanova, From heterogeneous task scheduling to heterogeneous mixed parallel scheduling, in: Proc. Euro-
Par 2004, Lecture Notes in Computer Science 3149 (2004) 230–237.
[26] J. Turek, J.L. Wolf, P.S. Yu, Approximate algorithms for scheduling parallelizable tasks, in: Proc. of 5th ACM Symposium on
Parallel Algorithms and Architectures (SPAA), 1992, pp. 323–332.

Task N Data Schedulling

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Task N Data Schedulling

Загружено:

Авторское право:

Доступные форматы

Parallel Computing 32 (2006) 759–774

An improved two-step algorithm for task and data

Fig. 2. Some frequently used terms in scheduling.

3. Related work and motivation

Fig. 3. Pseudo code of CPA algorithm.

3.1. Motivating example

4.1. Processor allocation phase

Fig. 4. Schedule generated by CPA algorithm.

(1) It is a critical path task i.e. bi + si = Lcp.

Fig. 5. Pseudo code for the MCPA algorithm.

4.2. Task scheduling phase

Fig. 6. Schedule generated by MCPA algorithm.

Diamond Tree Butterfly

Fig. 7. Some of the benchmark MDGs used for performance comparison.

6.1. Matmul application

Fig. 8. Relative performance of MCPA algorithm for Matmul application.

Fig. 9. Speedup comparison for Matmul.

6.2. Strassen application

Fig. 10. Relative performance of MCPA algorithm for Strassen application.

Fig. 11. Speedup comparison for Strassen.

6.3. Synthetic and random task graphs

Вам также может понравиться