Академический Документы
Профессиональный Документы
Культура Документы
I NTRODUCTION
R a tio
1 0 0
9 5
o f n o d e s
9 0
R a tio o f n o d e s ( % )
R a tio o f n o d e s ( % )
R a tio
1 0 0
o f n o d e s
9 0
8 5
8 0
7 5
7 0
8 0
7 0
6 0
5 0
00
00
00
<1
0
00
00
<1
00
00
<1
0
00
<1
00
<1
0
<1
0
0
< 2
D e g r e e s
< 1 0
N u m
< 2 0
< 3 0
< 4 0
< 5 0
b e r o f u p d a te s
M OTIVATION
Social network analysis is used to analyze the behavior of human communities. However, because of
humans group behavior, some FEPs may need large
amounts of computation and communication in each
iteration, and may take many more iterations to converge than others. This may generate stragglers which
slow down the analysis process.
Consider a Twitter follower graph [18], [22] containing 41.7 million vertices and 1.47 billion edges.
Fig. 1(a) shows the degree distribution of the graph.
It can be seen that less than one percent of the
vertices in the graph are adjacent to nearly half of
the edges. Also, most vertices have less than ten
neighbors, and most of them choose to follow the
same vertices because of the collective behavior of
humanity. This implies that tasks hosting these small
percentage of vertices will assume enormous computation load while others are left underloaded. With
persistence-based load balancers [21] and retentive
work stealing [21], which cannot support the computation decomposition of straggling FEPs, high load
imbalance will be generated for the analysis program.
For some applications, the data dependency graph
of FEPs may be known only at run time and may
change dynamically. This not only makes it hard to
evaluate each tasks load, but also leaves some computers underutilized after the convergence of most
features in early iterations. Fig. 1(b) shows the distribution of update counts after a dynamic PageRank
algorithm converges on the twitter follower graph.
From this figure, we can observe that the majority of
the vertices require only a single update, while about
20% of the vertices require more than 10 updates
to reach convergence. Again, this means that some
computers will undertake much more workload than
others. However, to tackle this challenge, the tasklevel load balancing approach based on profiled load
cost [19] has to pay significant overhead to periodically profile load cost for each data object and to
divide the whole data set in iterations. For PowerGraph [20], which tends to statically factor computation tasks according to the edge distribution of graphs,
this challenge also renders this static task planning
ineffective and leaves some computers underutilized,
especially when most features have converged in
early iterations.
O UR
APPROACH
Main idea
(1)
Gi()
ACC()
Virtual Master
Extra()
Virtual Worker
...
Extra()
Virtual Worker
Yh (k 1) = Aihm , and h = 1, 2, . . . , t
i
i
S
(k
1)
=
Extra(L
(k
1))
m hli
l=1 Ll (k 1) = Li (k 1)
(2)
where Aih0 = is a constant value and l = 1, 2, . . . , m.
The function Extra() is used to extract the local
attributes of subset Lil (k-1). Acc() is used to gather
and accumulate these local attributes, and Gi () is used
to calculate the value of xi (k) for the k th iteration
according to accumulated attributes.
Clearly, this decomposition can ensure the two required properties because the results of all decomposable parts, such as Yhi (k-1), can be calculated only
based on the results of its sub-processes, such as Shi l (k1), and its sub-processes are independent of each
other as well. Note that Equation 2 is only one of
the possible ways of decomposition. More specifically,
analysis process mainly runs in the following way.
First, all decomposable parts of a straggling features calculation are evenly factored into several subprocesses, and each sub-process, namely Extra(), only
processes values in the subset Lil (k-1), where the
values of set Lil (k-1) are broadcast to each worker at
the beginning of each iteration. To achieve this goal,
the set Li (k-1) needs to be partitioned into equal-sized
subsets and evenly distributed over the cluster.
When all assigned sub-processes for the feature
are finished, a worker only needs to accumulate and
process the attribute results of these subsets to obtain
the final value of this feature. In this way, the load
of straggling the FEPs decomposable part is evenly
distributed over clusters.
However, for non-straggling features, the extraction
processes input set Li (k-1) may contain few values,
whose number is determined by the dependency
graph of features. Thus, in practice, the input values of sub-processes are processed in units of blocks.
Specifically, each task processes a block of values and
a block may contain the input values of different
sub-processes, where the block partition is similar
to a traditional partition of data set. In reality, the
processing of each straggling feature is done as if it
TABLE 1
Notation used in the paper.
Notation
I
xi (k)
L(k)
Li (k)
Lil (k)
Yhi (k)
Shi (k)
l
Torigin (O)
Treal (O)
B(O)
Tb
Ta
C
Ts
N
Description
Raw data object set I= {D1 , D2 , . . . , Dn } (each data object owns a feature).
Feature value of the ith data object Di at the kth iteration.
A feature set at the kth iteration, where L(k)= {x1 (k), x2 (k), . . . , xn (k)}.
A subset of L(k) and contains all the values needed by the calculation of xi (k + 1).
The lth subset of set Li (k).
The hth attributes of the set Li (k). It can be the maximum value, minimum value, numbers, sum of value and so on.
It is the result of a decomposable part for the calculation of xi (k + 1). In other words, the code to calculate Yhi (k) is a
decomposable part of the process for xi (k + 1).
The hth attribute of the set Lil (k).
Processing time of feature O needed in an iteration without partition and distribution of its decomposable parts.
Processing time of feature O needed in an iteration after the partition and distribution of its decomposable parts.
Constant time to accumulate result value of two partitioned blocks for features.
Total number of partitioned blocks for feature O.
Processing time needed before the redistribution of blocks.
Processing time needed after the redistribution of blocks.
Extra time caused by redistribution.
Processing time spared after the redistribution of blocks.
Number of redistributed blocks.
The above computational decomposition only attempts to create more chances for straggling FEPs
to be processed with roughly balanced load among
tasks. After the decomposition, some workers may
be more heavily loaded than others. For example, as
described in Section 2, most FEPs have converged
during the first several iterations. Then, the workers
assigned with these converged FEPs may become idle.
Consequently, it needs to identify whether it should
redistribute blocks among workers based on the previous load distribution to make the load cost of all
workers balanced and to accelerate the convergence
of the remaining FEPs.
Now, we show the details of deciding when to
redistribute blocks according to a given clusters condition. In reality, whenever the periodically profiled
remaining load of workers is received, or a worker
becomes idle, it determines whether a block redistribution is needed. It redistributes blocks only when
Ta < Tb C.
(4)
21:
Lm Min(Lc Lt , Ft .LoadRemaining )
Lm
22:
N (Ft ) b Torigin
(Ft ) c Bmax (Ft )
23:
/ Insert N (Ft ) numbers of Ft s
24:
unprocessed blocks into Bset . /
25:
Bset .InsertStraggling(Ft , N (Ft ))
26:
Lt Lt + Lm
27:
end if
28:
end while
29:
/ Transform all blocks contained in Bset . /
30:
Transform(Wd , Bset )
31: end procedure
(5)
Performance analysis
(8)
With SAE, the decomposable parts of straggling features can be evenly distributed over workers. Assume
the computational skew of the decomposable parts is
o after the work distribution, then the execution time
of the decomposable parts is
Ed = E DP R (1 + o ),
(9)
NAll
1,
NW ork
(10)
TABLE 2
Programming interfaces
Interfaces
Extra(FeatureSet Oset )
It is employed to extract the local attributes of features in set
Oset .
Barrier(FeatureSet Oset )
Check whether all attributes needed by features contained in
Oset are available. It also can contain a feature.
Get(Feature O)
Gather the results of all distributed sub-processes for decomposable parts of feature Os calculation.
Acc(FeatureSet Oset )
It is employed to gather and accumulate the local attributes
calculated by distributed sub-processes, namely Extra(), for
features contained in Oset . It can be executed only when all
attributes needed by features contained in Oset are available.
Diffuse(FeatureSet Dset , FeatureSet Oset )
Spread the value of features contained in FeatureSet Oset to all
workers processing user specified features contained in Dset
for the next iteration. In reality, it just notifies the runtime
that the values need to be transferred to which place. The real
transformation is done by runtime and occurred at the end of
each iteration in together with others.
Termination()
Used to check whether termination condition is met after each
iteration.
=
=
Ep
Eo
1+L
1+DP Ro +(1DP R)L + O
E
(13)
From this analysis, we can observe that for algorithm with a low decomposable part ratio DP R or
low imbalance degree L its performance may not be
improved much. Thus our approach is more suitable
to applications with high load imbalance degree and
high decomposable part ratio.
4
4.1
I MPLEMENTATION
Programming model
Iteration
SAE
Master
Computation distribution
Get
Barrier
Extra()
Synchronous Table
...
Worker
Acc()
Worker
SysBarrier
Diffuse
Calculation
Messages distribution
Receive
E XPERIMENTAL
EVALUATION
TABLE 3
Data sets summary
Data sets
Twitter graph
TSS
Scale
# nodes 41.7 million, # edges 1.47 billion
# vehicles 1 billion
Lmax
1
Lavg
(14)
Cmax
1
Cavg
(15)
2 .4
2 .0
1 .6
1 .2
0 .8
0 .4
0 .0
A d s o r p tio n
P a g e R a n k
K a tz m e tr ic
8 0
7 0
6 0
5 0
4 0
1
C C o m
1 0
P L B
S A E
3 .0
2 .5
2 .0
1 .5
1 .0
0 .5
0 .0
A d s o r p tio n
P a g e R a n k
2 0
2 5
3 0
3 5
4 0
4 3
K a tz m e tr ic
P o w e rG ra p h
R W S
2 6 0
2 4 0
2 2 0
2 0 0
1 8 0
1 6 0
1 4 0
1 2 0
1 0 0
8 0
6 0
4 0
2 0
2 .0
1 .6
1 .2
0 .8
0 .4
0 .0
A d s o r p tio n
C C o m
1 0
1 5
P L B
S A E
2 0
2 5
3 5
4 0
3 .5
P U C
R W S
P L B
S A E
3 .0
2 .5
2 .0
1 .5
1 .0
0 .5
0 .0
4 3
A d s o r p tio n
N u m b e r o f ite r a tio n s
(b) TSS
C C o m
4 .0
3 0
P a g e R a n k K a tz m e tr ic
B e n c h m a rk s
B e n c h m a rk s
P L B
S A E
2 .4
N u m b e r o f id le c o r e s
C o m p u ta tio n a l im b a la n c e d e g r e e
3 .5
1 5
P o w e rG ra p h
R W S
2 .8
N u m b e r o f ite r a tio n s
P U C
R W S
3 .2
9 0
B e n c h m a rk s
4 .0
P L B
S A E
C o m m u n ic a tio n im b a la n c e d e g r e e
2 .8
P o w e rG ra p h
R W S
1 0 0
P L B
S A E
C o m m u n ic a tio n im b a la n c e d e g r e e
P o w e rG ra p h
R W S
R a tio o f c o n v e r g e d fe a tu r e s ( % )
C o m p u ta tio n a l im b a la n c e d e g r e e
3 .2
10
P a g e R a n k K a tz m e tr ic
C C o m
B e n c h m a rk s
(b) TSS
Fig. 6. The computational skew Fig. 7. The convergence conditions Fig. 8. The communication skew
of different schemes tested on the of Adsorption algorithm with differ- of different schemes tested on the
Twitter graph and TSS, respectively. ent schemes over the Twitter graph. Twitter graph and TSS, respectively.
T w
1 0 0
itte r g r a p h
T S S
R a tio ( % )
8 0
6 0
4 0
2 0
0
A d s o r p tio n
P a g e R
a n k
B e n c h m
K a tz
e tr ic
o m
a r k s
5.2
Runtime overhead
After that, the execution time break down of benchmarks with our approach is also evaluated to show
the runtime overhead of our approach. As described
in Fig. 10, we can observe that the maximum time
R e la tiv e c o m p u ta tio n a l o v e r h e a d
1 .4
K a tz m e tr ic
P a g e R a n k
A d s o r p tio n
0
2 0
4 0
6 0
8 0
1 .0
0 .8
0 .6
0 .4
0 .2
0 .0
A d s o r p tio n
1 0 0
P a g e R a n k
R e la tiv e c o m p u ta tio n a l o v e r h e a d
B e n c h m a rk s
C C o m
K a tz m e tr ic
P a g e R a n k
A d s o r p tio n
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
P U C
R W S
8
(b) TSS
0 .8
0 .6
0 .4
0 .2
0 .0
C C o m
A d s o r p tio n
6
5
4
3
2
1
0
P a g e R a n k
K a tz m e tr ic
C C o m
A d s o r p tio n
P a g e R a n k
B e n c h m a rk s
P L B
S A E
P L B
S A E
1 .0
R u n tim e w a s te d tim e
C o m m u n ic a tio n tim e
D a ta p r o c e s s in g tim e
1 0
K a tz m e tr ic
P o w e rG ra p h
R W S
1 .2
B e n c h m a rk s
1 .4
P L B
S A E
R e la tiv e c o m m u n ic a tio n o v e r h e a d
B e n c h m a rk s
C C o m
P o w e rG ra p h
R W S
1 .2
R e la tiv e c o m m u n ic a tio n o v e r h e a d
R u n tim e w a s te d tim e
C o m m u n ic a tio n tim e
D a ta p r o c e s s in g tim e
11
K a tz m e tr ic
C C o m
B e n c h m a rk s
P U C
R W S
5
P L B
S A E
4
3
2
1
0
A d s o r p tio n
P a g e R a n k
K a tz m e tr ic
C C o m
B e n c h m a rk s
(b) TSS
(b) TSS
Fig. 10. Execution time break down Fig. 11. The relative computational Fig. 12. The relative communicaof different benchmarks executed overhead against SAE tested on the tion overhead against SAE tested
Twitter graph and TSS, respectively. on the Twitter graph and TSS.
over SAE.
wasted by our approach only occupies 21.0% of the
total execution time. For katz metric algorithm, the
time wasted by our approach even only occupies
12.1% of an iterations total execution time. Moreover,
the time to process features and the time to communicate intermediate results between iterations are up
to 72.7% and 15.2%, respectively.
Then the extra computational and extra communication overhead of our approach and current solutions
are presented in Fig. 11 and Fig. 12, respectively.
Note that the results in these two figures are the
relative computational and communication overhead
against related overhead of SAE, respectively. We can
observe that the computational and communication
cost of our approach are more than PLB and RWS,
where its runtime overhead mainly includes the time
of straggling FEPs identification and decomposition
as well as its computation redistribution. Take the
Twitter graph as the example, for the PageRank algorithm the computational overhead of SAE is even
1.92 times more than PLB and 1.73 times more than
RWS, respectively. Yet, as shown in the following part,
we can observe that its overhead is negligible against
the benefits gained from the skew-resistance for straggling FEPs. Obviously, it is a trade-off between the
caused overhead and the imbalance degree of social
network analysis.
From Fig. 11 and Fig. 12, we can also observe that
the runtime overhead of PowerGraph is almost the
same as SAE. However, the computational overhead
and communication overhead of PUC are 8.4 and 5.5
Performance
From Fig. 6(a) and Fig. 6(b), we can observe that SAE
can guarantee low computational imbalance degree
and also low communication imbalance degree via
straggler-aware execution. It also demonstrates that
the load imbalance caused by the remaining computational part of straggling FEPs is small. Take the Adsorption algorithm as the example, its computational
imbalance degree are only 0.12 and 0.18 for the Twitter
graph and TSS, respectively. Its communication imbalance degree also are only 0.33 and 0.36 for the Twitter
graph and TSS, respectively.
Because of such a low computational imbalance
degree and communication imbalance degree, our
approach can get a better performance against current
solutions. The speedup of current solutions against
the performance of PLB on the above benchmarks are
evaluated. The results are described in Fig. 13(a) and
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TCC.2015.2415810, IEEE Transactions on Cloud Computing
IEEE TRANSACTIONS ON CLOUD COMPUTING
12
8
P o w e rG ra p h
R W S
3 .5
P L B
S A E
3 .0
2 .5
7
6
S p e e d u p
1 .5
1 .0
2 .0
1 .5
4
3
1 .0
0 .5
0 .5
0 .0
0 .0
A d s o r p tio n
P a g e R a n k
K a tz m e tr ic
C C o m
2
1
A d s o r p tio n
P a g e R a n k
B e n c h m a rk s
K a tz m e tr ic
C C o m
B e n c h m a rk s
R ELATED
WORKS
0
1 6
3 2
6 4
1 2 8
2 5 6
N u m b e r o f c o re s
(b) TSS
T w itte r g r a p h
T S S
P L B
S A E
2 .5
2 .0
S p e e d u p
P U C
R W S
S p e e d u p
3 .0
13
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
C ONCLUSION
For social network analysis, the convergence of straggling FEP may need to experience significant numbers
of iterations and also needs very large amounts of
computation and communication in each iteration,
inducing serious load imbalance. However, for this
problem, current solutions either require significant
overhead, or can not exploit underutilized computers
when some features converged in early iterations, or
perform poorly because of the high load imbalance
among initial tasks.
This paper identifies that the most computational
part of straggling FEP is decomposable. Based on this
observation, it proposes a general approach to factor
straggling FEP into several sub-processes along with
a method to adaptively distribute these sub-processes
over workers in order to accelerate its convergence.
Later, this paper also provides a programming model
along with an efficient runtime to support this approach. Experimental results show that it can greatly
improve the performance of social network analysis against state-of-the-art approaches. As shown in
Section 5, the master in our approach may become
a bottleneck. In future work, we will study how to
employ our approach in a hierarchical way to reduce
the memory overhead and evaluate its performance
gain.
ACKNOWLEDGMENTS
This work was supported by National High-tech
Research and Development Program of China (863
Program) under grant NO. 2012AA010905, China
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
14