Академический Документы
Профессиональный Документы
Культура Документы
Abstract—With the rapidly growing demand of graph processing in the real scene, they have to efficiently handle massive
concurrent jobs. Although existing work enable to efficiently handle single graph processing job, there are plenty of memory
access redundancy caused by ignoring the characteristic of data access correlations. Motivated such an observation, we
proposed two-level scheduling strategy in this paper, which enables to enhance the efficiency of data access and to accelerate
the convergence speed of concurrent jobs. Firstly, correlations-aware job scheduling allows concurrent jobs to process the
same graph data in Cache, which fundamentally alleviates the challenge of CPU repeatedly accessing the same graph data in
memory. Secondly, multiple priority-based data scheduling provides the support of prioritized iteration for concurrent jobs, which
is based on the global priority generated by individual priority of each job. Simultaneously, we adopt block priority instead of
fine-grained priority to schedule graph data to decrease the computation cost. In particular, two-level scheduling significantly
advance over the state-of-the-art because it works in the interlayer between data and systems.
Index Terms— Graph processing, Concurrent jobs, Data access correlation, Prioritized iteration
—————————— ——————————
1 INTRODUCTION
M
graph
ny enterprises, such as Twitter, Facebook, etc.,
widely use graph processing to analyze data due to
enables clearly express the inherent
same nodes (or edges) in graph data. However, existing
graph processing systems does not utilize this feature,
resulting in accessing data in memory inefficiently.
interdependencies between entities (i.e., social networks Firstly, as we known the poor locality which is
and transportation graphs). Therefore, in recent years, attributed to the random accesses in traversing the
related research has attracted extensive attention from neighborhood nodes in graph processing. What’s worse is
academia and industry. For processing graph efficiently, that the locality will be even worse in concurrent jobs,
there are a great deal of graph computing platforms were which is because of jobs access data independently and
proposed. The distributed systems exploite the powerful further causes CPU to repeatedly request the same data in
computation resources of clusters to process large graphs, memory at different moment for different jobs. In
such as Pregel [6], GraphLab [10], PowerGraph [8], addition, the times of repetitively access graph data goes
GraphX [9], etc. Alternatively, single-machine systems, up as the number of jobs increases. For two different
like GraphChi [11], X-Stream, Gridgraph, etc., are able to concurrent jobs, they will lead CPU to access the same
handle graphs with billions of edges by using secondary nodes (or edges) twice in one iteration altogether. Hence,
storage efficiently, and largely eliminate the challenges of in the conventional hierarchical memory system, this
synchronization overhead [5, 7] and load imbalance [4] in challenge will lead to significant multitudes of data
distributed frameworks. movements. It further gives rise to low performance of
However, with the rapidly growing demand of graph data accessing and inefficiency of Cache using. We called
processing in the real scene, the number of concurrent this challenge is Memory Access Redundancy.
graph processing jobs is greatly increased in data analysis Secondly, iterative computation is widely used in
platforms. Such as, Didi data analysis platform need to do graph processing. Although some work, such as PrIter [2],
more than 9 billion Route Planning daily in 2017, in other can accelerate the convergence speed of iterative
words, it is about 6 million times per minute. Therefore, algorithms by prioritized execution, they incur Memory
it’s important that handle concurrent jobs more efficiently. Access Redundancy and more random accesses when
Based on the feature that most concurrent jobs occur on processing multiple concurrent jobs. For example,
the same graph structure data, the main work of multiple jobs have some intersections in priority queues
concurrent jobs includes Seraph [1] which tries to solve and access them separately, it will bring about Memory
the problems of inefficient use of memory and high cost Access Redundancy. Moreover, the computation cost of
of fault tolerance through decoupling the data model and the fine-grained priority (each node has a priority) can be
new computing logics. So, it means that multiple jobs are decreased by replaced with slightly coarse-grained
able to share the same graph structure data in memory, priority.
by it improve the utilization of computer resources. There In this paper, according to the characteristic of data
is an important characteristic is data access correlation access correlation, we try to slove the above problems by
which caused by jobs share same graph structure data in two kinds of effective scheduling policies: correlation
memory. Such as, different concurrent jobs access the aware job scheduling (CAJS for short) and multiple
priority data scheduling (MPDS for short). First of all,
————————————————
xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society
2 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
The current mode of concurrent jobs access graph Cache miss rate
structure data in memory is shown in Figure 3. And it 60
Cache miss(%)
50
times (T1, T2, and T3) in one iteration. At T1, when Job1
accesses D2, the rest of jobs (e.g., Job2, Jobn, etc,.) may be 40
access other data blocks. Therefore, the D2 will be
transfered to the Cache line of ID as j to facilitating CPU 30
access D2. After a while (T2), Jobn accesses Di, and Di is
mapped to Cache line of ID as j. So D2 in the Cache line 20
1 2 3 4 5 6 7 8
will be replaced by Di. But Job2 may be acess D2 at T3. the number of jobs
However, CPU can not access D2 in Cache because of D2 pagerank wcc bfs
has been replaced by Di. So D2 needs to be reloaded into
the Cache. We can clearly realize that D2 was copied from Fig. 4: Cache miss rate as the number of concurrent jobs increased.
main memory to cache tiwce in this mode. Hence, the
CPU Cache hit ratio has been reduced.
Apparently, with the number of concurrent jobs
increases, the number of same data in memory were
accessed by CPU at different times will be increased in
practice. Simultaneously, we record the change of Cache
miss rate (as shown in Figure 4) and the percentages of
CPU cache stall (the CPU time waiting for the requested
Single node
Job 1 Job 2 Job n
‥‥‥
Fig. 5: CPU execution and CPU Cache Stall over sd1-arc
Job-specific Job-specific Job-specific
data data data
scale data mining algorithms. PrIter [2] enables fast
•••
iterative computation by providing the support of
j-1
j D2
① ③ ②
j+1 •••
prioritized iteration, it means extracts the subset of data
Cache that have higher priority values to perform the iterative
D1 D2 ‥‥‥ Di ‥‥‥
updates. Importantly, priority value of each node is
Memory
determined by the influence on convergence (e.g., △ 𝑃𝑃𝑗𝑘 in
Graph Data PageRank).
① ② ③
However, the existing methods only suitable for
timeline
accelerating the convergence speed of a single job, but not
for concurrent jobs. Because the algorithm characteristics
Fig. 3: Current mode of data access. (①,②,③ are represent T1, T2, and computation states are different of concurrent jobs,
T3.) the priority values of jobs are not identical. Such as, there
are three concurrent jobs: Job1, Job2, and Job3. It is
data due to the cache miss) and the CPU execution time assumed that priority queue for each job are {A, B, C, D},
while the CPU keeps computing (as shown in Figure 5) as {B, C, D, E}, {C, D, E, F}, respectively. Hence, CPU need to
the number of concurrent jobs increased. We can realize access nodes {A, B, C, D, E, F} for concurrent jobs, and
that tht performace of Cache was decreasing because of node C may be transfered into Cache three times. Hence,
Cache hit rate and the times of Cache miss were based on the priority values of each job to prioritize
increasing. A very important reason is that the same data iteration separately will cause most of the data in memory
is accessed by different jobs at different time, resulting in be accessed and make the prioritized iteration inefficient.
the same data transfered to Cache many times. Therefore, In particular, it’s usually used to save memory
in this work, we want to propose a method to reduce the resources by sharing graph data in memory for
negative impacts by reduce the number of times of concurrent jobs. In single-machine graph processing
copying data from main memory to cache, or from slower systems, if a job has completed the prioritized iteration in
cache to faster cache. We will introduce this method one interation firstly, the finished job need to wait for
clearly in Section 4. other jobs to finsh the prioritized iteration before the new
graph data can be transferred into memory. So, if the
2.2 Slow Convergence finished job continues to compute other nodes which
Iterative computation is a universal method in big data have low priority values when waiting, we believe the job
mining algorithms, such as social network analysis, will converge more quickly. At the same time, because it
recommendation systems, and so on. Hence, the existing is necessary to load graph data partitions to memory
graph processing algorithms basically adopt iterative frequently in single machine systems and the number of
computations. Therefore, it is essential to speed up the iterations will increase in prioritized iteration, it is
convergence speed in iterative computations of large- another very important problem that the secondary
4 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
New Job
Single node
Job 1 Job 2 Job 3 … initPtable De_Gl_Priority
Job-specific data Job-specific data Job-specific data Job controller
Con_processing De_In_Priority
Globe priority queue j-1 •••
j B2
… B2 Bi … j+1 ••• Job is
Unconverged jobs
Cache convergence?
converged jobs
B1 B2 ‥‥‥ Bi ‥‥‥
Memory Exit
Graph Data
Fig. 9: The use of APIs.
• De_In_Priority: decide the priority queue for each job Fig. 7: Generate the globe priority queue.
based on block.
• De_Gl_Priority: decide the globle priority queue
8 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
greatly and causes more frequent priority queue jobs. Firstly, correlation aware job scheduling allows
maintenance operations. On the other hand, if the priority concurrent jobs process same graph data in Cache, which
queue length is too large, then the amount of computation enable enhance the Cache hit rate. Secondly, multiple
per iteration will be increased. Hence, it is not much priority data scheduling provides the support of prioritized
different from traditional methods. Instead, maintaining iteration for concurrent jobs, which according to global
priority queues requires overhead. priority generated by individual priority of each job.
In order to determine the priority queue length q based
on block scheduling, we first comprehend the priority REFERENCES
queue length Q based on nodes. In PrIter, the priority
queue length for each individual task is selected to [1] Xue J, Yang Z, Qu Z, et al. Seraph: an efficient, low-cost
𝑄𝑄 = 𝐶𝐶 ∗ �𝑉𝑉𝑁 , 𝑉𝑉𝑁 is the number of nodes in the graph system for concurrent graph processing[M]. 2014.
structure, C is a constant and be 100 by default. The [2] Zhang Y, Gao Q, Gao L, et al. PrIter:a distributed
selection formula of Q can obtain better convergence framework for prioritized iterative computations[C]// ACM
acceleration effect, so we hope to apply this method to the Symposium on Cloud Computing. ACM, 2011:1-14.
prioritized iteration based on block scheduling. Hence, [3] X. Liao, H. Jin, Y. Liu, L. Ni, and D. Deng, “Anysee:
we set q to be q = Q/𝑉𝑉𝐵 . If the number of blocks is 𝐵𝐵𝑁 , Peer-to-peer live streaming,” in Proc. INFOCOM,
then 𝑉𝑉𝑁 = 𝐵𝐵𝑁 × 𝑉𝑉𝐵 . Then we have the optimal length of 2006:1-10.
globe priority queue: [4] P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan,
q = C ∗ 𝐵𝐵𝑁 /�𝑉𝑉𝑁 (4) “Replication-based fault-tolerance for large-scale graph
and also set the C default to 100. processing,” in 2014 44th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks. IEEE,
6 RELATED WORK 2014, pp. 562–573.
Distributed graph processing systems store the graph
data to each node by partition and exploiting the [5] Z.Khayyat,K.Awara,A.Alonazi,H.Jamjoom,D.Williams,an
powerful computation resources of clusters. However, d P. Kalnis, “Mizan: a system for dynamic load balancing in
due to the strong correlation between graph data and the large-scale graph processing,” in Proceedings of the 8th
characteristic of power-law distribution, there will be lots ACM European
of synchronization overhead and load imbalance in
distributed environments. These challenges are the [6] Malewicz G, Austern M H, Bik A J C, et al. Pregel: a
bottleneck of distributed graph processing system. system for large-scale graph processing[C]// ACM
At the same time, single-machine graph processing SIGMOD International Conference on Management of
systems are able to handle graphs with billions of edges Data. ACM, 2010:135-146.
by using secondary storage efficiently, and largely [7] Y.Zhao,K.Yoshigoe,M.Xie,S.Zhou,R.Seker,andJ.Bian,
eliminate all the challenges of distributed frameworks. [In “Lightgraph: Lighten communication in distributed graph-
this kind of systems, the graph data partitions are parallel processing,” in 2014 IEEE International Congress
processed in memory in order, and the partitions of
on Big Data. IEEE, 2014, pp. 717–724.
remaining data are stored in the secondary storage.
Therefore, users do not require powerful distributed
[8] Gonzalez J E, Low Y, Gu H, et al. PowerGraph: distributed
clusters when using such systems, as well as do not need graph-parallel computation on natural graphs[C]// Usenix
the ability to managing and tuning distributed clusters. Conference on Operating Systems Design and
With the development of hardware technology in recent Implementation. USENIX Association, 2012:17-30.
years, limited computing power and system resources, [9] Gonzalez J E, Xin R S, Dave A, et al. GraphX: graph
the bottleneck of single-machine, can be effectively processing in a distributed dataflow framework[C]//Usenix
alleviated. Firstly, many commodity single-node servers Conference on Operating Systems Design and
can easily extend memory to hundreds of GBs to TBs. Implementation. USENIX Association, 2014:599-613.
Secondly, current accelerators have much higher massive [10] Yucheng Low, Joseph E. Gonzalez, Aapo Kyrola, et al.
parallelism and memory access bandwidth than GraphLab: A New Framework For Parallel Machine
traditional CPUs, which has the potential to offer high-
Learning[J]. Computer Science, 2010.
performance graph processing. Such as, Garaph , CuSha ,
etc., based on GPU, and etc. based on FPGA.
[11] Kyrola A, Blelloch G, Guestrin C. GraphChi: large-scale
graph computation on just a PC[C]// Usenix Conference on
7 CONCLUSION 46.
Operating Systems Design and Implementation. 2012:31-
With the rapidly growing demand of graph processing in the
real scene, we have to efficiently handle massive concurrent
jobs. In order to solve the problems existing in the existing
methods, we proposed two-level scheduling strategies in this
paper, which enable enhance the efficiency of the data
access and accelerate the convergence speed of concurrent