Graph500 BigData2016 Paper

Extreme Scale Breadth-First Search on Supercomputers
Koji Ueno Toyotaro Suzumura Naoya Maruyama

Tokyo Institute of Technology IBM T.J. Watson Research Center RIKEN
Tokyo, Japan New York, USA Kobe, Japan
kojiueno5@gmail.com suzumura@acm.org nmaruyama@riken.jp
Katsuki Fujisawa Satoshi Matsuoka

Kyushu University Tokyo Institute of Technology / AIST
Fukuoka, Japan Tokyo, Japan
fujisawa@imi.kyushu-u.ac.jp matsu@acm.org
Abstract—Breadth-First Search(BFS) is one of the most Kronecker graph, known to model realistic graphs arising out
fundamental graph algorithms used as a component of many of practical applications, such as web and social networks, as
graph algorithms. Our new method for distributed parallel well as those that arise from life science applications. As
BFS can compute BFS for one trillion vertices graph within such, attaining high performance on the Graph500 represents
half a second, using large supercomputers such as the K- the important abilities of a machine to process real-life,
Computer. By the use of our proposed algorithm, the K- large-scale graphs arising from big-data applications.
Computer was ranked 1st in Graph500 using all the 82,944
nodes available on June and November 2015 and June 2016 We have conducted a series of work [13][14][15] to
38,621.4 GTEPS. Based on the hybrid-BFS algorithm by accelerate BFS in a distributed memory environment. Our
Beamer[3], we devise sets of optimizations for scaling to new work extends the data structures and algorithm called
extreme number of nodes, including a new efficient graph data Hybrid BFS[2] that is known to be effective small-diameter
structure and optimization techniques such as vertex graphs, so that it scales to top-tier supercomputers with tens
reordering and load balancing. Performance evaluation on the of thousands of nodes with million-scale CPU cores with
K shows our new BFS is 3.19 times faster on 30,720 nodes than multi gigabyte/s interconnect. In particular we apply our
the base version using the previously-known best techniques. algorithm to the Riken’s K-Computer[16] with 82,944
compute nodes and 663,552 CPU cores, once the fastest
Keywords- Distributed memory: Breadth-First Search:
supercomputer in the world on the Top 500 with over 10
Graph500
Petaflops. The result obtained is currently No.1 on the
Graph500 for three consecutive editions since 2015, with
I. INTRODUCTION significant TEPS performance advantage compared to the
Graphs have quickly become one of the most important result obtained on other machines such as the Sequoia and
data structures in modern big data apps, such as social media, TaihuLight supercomputers with millions of cores and
modeling of biophysical structures and phenomena such as superior FLOPS performances over K and higher rankings
brain’s synaptic connections, and/or interaction network on the Top500. This demonstrates that architectural
between proteins and enzymes, for predictive analysis. The properties other than the amount of FPUs, as well as
common properties amongst such modern applications of algorithmic advances play a major role in attaining
graphs are their massive size and complexity, reaching up to performance in graphs. In fact, the top ranks of the Graph500
billions of edges and trillions of vertices, resulting not only have been dominated by large-scale supercomputers to date,
in tremendous storage requirements but also compute power. while Clouds are missing; performance measurements
reveal that this is fundamental, in that interconnect
With such high interest in analytics of large graphs, a performance, central to supercomputing, plays a significant
new benchmark called the Graph500 [1][15] was proposed in role in the overall performance of large-scale BFS.
2011. Since the predominant use of supercomputers had been
for numerical computing, most of the HPC benchmarks such
as the Top 500 Linpack had been compute centric. The II. BACKGROUND: HYBRID BFS
Graph500 benchmark instead measures the data analytics
performance, in particular those for graphs, with the metric A. The Base Hybrid BFS Algorithm
called traversed edges per second or TEPS. More specifically, We first describe the background BFS algorithms,
the benchmark measures the performance of breadth-first including hybrid algorithm as proposed in [2]. Figure 1
search (BFS), which is utilized as a kernel for important and shows the standard sequential textbook BFS algorithm.
more complex algorithms such as connected components Starting from the source vertex, the algorithm conducts the
analysis and centrality analysis. The target graph used in the search by effectively expanding the “frontier” set of vertices
benchmark is a scale-free, small diameter graph called the
in a breadth-first manner from the root. We refer to this whether a frontier node is included in its direct neighbor. If it
search direction as “top-down”. is, then the node is added to the frontier of visited nodes for
the next iteration. In general, this “bottom-up” approach is
Function breadth-first-search (vertices, source) more advantageous over top-down when the frontier is large,
1. frontier ← {source} as it will quickly identify and mark many nodes as visited.
2. next ← {} On the other hand, top-down is advantageous when the
3. parents ← [-1,-1,…,-1] frontier as small, as bottom-up will result in wasteful
4. while frontier ≠ {} do scanning of many unvisited vertices and their edges without
much benefit.
5. | top-down-step (vertices, frontier, next, parents)
6. | frontier ← next For a large but small-diameter graphs such as the
7. | next ← {} Kronecker graph used in the Graph500, the hybrid BFS
8. return parents algorithm [2] (Figure 3) that heuristically minimizes the
number of edges to be scanned by switching between top-
down and bottom-up, has been identified as very effective in
Function top-down-step (vertices, frontier, next, parents)
significantly increasing the performance of BFS.
9. for v ∈ frontier do
10. | for n ∈ neighbors[v] do B. Parallel and Distributed BFS Algorithm
11. | | if parents[n] = -1 then
In order to parallelize the BFS algorithm over distributed
12. | | | parents[n] ← v memory machines, it is necessary to spatially partition the
13. | | | next ← next ∪ {n} graphs. A proposal by Beamer et. al. [3] conducts 2-D
partitioning of the adjacency matrix of the graph in two
Figure 1. Top-Down BFS dimensions, as shown in Figure 4, where adjacency matrix A
is partitioned into R C submatrices.
Function bottom-up-step (vertices, frontier, next,
parents)
1. for v ∈ vertices do
2. | if parents[v] = -1 then
3. | | for n ∈ neighbors[v] do
4. | | | if n ∈ frontier then
Figure 4. R x C partitioning of adjacency matrix A
5. | | | | parents[v] ← n
6. | | | | next ← next ∪ {v} Each submatrices is assigned to a compute node; the
7. | | | | break compute nodes themselves are virtually arranged into a
R C mesh, being assigned a 2-D index P (i, j ) . Figures 5
Figure 2. A Step in Bottom-Up BFS
and 6 illustrate the top-down and bottom-up parallel-
distributed algorithms with such a partitioning scheme. In the
Function hybrid-bfs (vertices, source) figures, P (:, j ) means all the processors in j-th column of
1. frontier ← {source}
2D processor mesh, and P (i,:) means all the processors in i-
2. next ← {}
th row of 2D processor mesh. Line 7 of Figure 5 performs
3. parents ← [-1,-1,…,-1] the allgatherv communication operation among all the
4. while frontier ≠ {} do processors in j-th column and Line 12 performs the alltoallv
5. | if next-direction() = top-down then communication operation among all the processors in i-th
6. | | top-down-step (vertices, frontier, next, parents) row.
7. | else
8. | | bottom-up-step (vertices, frontier, next, parents) In Figures 5 and 6, f , n , correspond to frontier, next,
9. | frontier ← next parent in the base sequential algorithms respectively.
Allgatherv() and alltoallv() are standard MPI collectives. The
10. | next ← {}
top-down algorithm requires transpose-vector(), allgatherv(),
11. return parents and alltoallv() per each iteration. In [4], transpose-vector()
and allgatherv() are referred to as Expand, and the latter
Figure 3. Hybrid BFS[2] computation including alltoallv() as Fold. For bottom-up, the
Expand part is the same as top-down, but the Fold part is
different; for each search iteration, the Fold step consists of
A contrasting approach is “bottom-up” BFS as shown in C sub-steps. Beamer[3]’s proposal encodes f , c , n , w
Figure 2. This approach is to start from the vertices that have as 1 bit per vertex for optimization.
not been visited, and iterate with each step investigating
Parallel-Distributed Hybrid BFS is similar to the III. PROBLEMS OF HYBRID BFS IN EXTREME-SCALE
sequential algorithm in Figure 3, heuristically switching SUPERCOMPUTERS
between top-down and bottom-up per each iteration step,
Although the algorithm in Section II would work
being essentially a hybrid of algorithms in Figures 5 and 6.
efficiently on a small-scale machine, for extremely large, up
Function parallel-2D-top-down ( A , source) to and beyond million-core scale supercomputers towards
1. f ← {source} exascale, various problems would manifest themselves
2. n ← {} which severely limit the performance and scalability of BFS.
3. ← [-1,-1,…,-1] We describe the problems in Section III, and present our
4. for all compute nodes P (i, j ) in parallel do solutions in Section IV.
5. | while f ≠ {} do
6. | | transpose-vector( f i , j ) A. Data Structure of the Adjacency Matrix
7. | | f i = allgatherv( f i , j , P (:, j ) ) The data structure describing the adjacency matrix is of
8. | | t i, j ← {} significant importance as it directly affects the computational
9. | | for u ∈ f i do complexity of graph traversal. For small machines, the
10. | | | for v ∈ Ai, j (:, u) do typical strategy is to employ the CSR (Compressed Sparse
11. | | | | t i, j ← t i, j ∪ {( u, v )} Row) format, commonly employed in numerical computing
12. | | wi , j ← alltoallv( t i, j , P (i, :) ) to express sparse matrices. However, we first show that
13. | | for ( u, v ) ∈ wi , j do direct use of CSR is impractical due to its memory
14. | | | if i , j (v) = -1 then requirements on a large machine; we then show that, the
15. | | | | i , j (v) ← u existing proposed solutions, DCSR[7] and Coarse index +
16. | | | | ni , j ← ni , j ∪ v Skip list [8] that intend to reduce the footprint at the cost of
increased computational complexity, are still insufficient for
17. | | f ← n large graphs with significant computational requirement.
18. | | n ← {}
19. return 1) CSR (Compressed Sparse Row)
CSR utilizes two arrays dst that holds the destination
Figure 5. Parallel-Distributed 2-D Top-Down Algorithm [3] vertex ID of the edges in the graph, and row-starts that
describes the offset index of the edges of each vertex in the
Function parallel-2D-bottom-up ( A , source) dst array. Given a graph with V vertices and E edges, the size
1. f ← {source} -> bitmap for frontier of dst = E and row-starts = V respectively, so the required
2. c ← {source} -> bitmap for completed memory would be as follows in a sequential implementation:
3. n ← {} V E (1)
4. ← [-1,-1,…,-1]
For parallel-distributed implementation with R C
5. for all compute nodes P (i, j ) in parallel do
partitioning, if we assume that the edges and vertices are
6. | while f ≠ {} do
distributed evenly, since the number of rows in the
7. | | transpose-vector( f i , j )
distributed submatrices is V / R , the required memory per
8. | | f i = allgatherv( f i , j , P (:, j ) )
node is:
9. | | for s in 0…C-1 do // C sub-steps
10. | | | t i, j ← {} V E
11. | | | for u ∈ ci , j do (2)
R RC
12. | | | | for v ∈ Ai , j (u, :) do By denoting the average vertices per node as V and the
13. | | | | | if v ∈ f i then
14. | | | | | | t i, j ← t i, j ∪ {( v, u )} average degree of the graph as d̂ , since the followings hold:
15. | | | | | | ci , j ← ci , j \ u V
16. | | | | | | break V , E Vdˆ (3)
RC
17. | | | wi , j ← sendrecv( t i, j , P (i, j s ) , P (i, j s ) )
18. | | | for ( v, u ) ∈ wi , j do (2) can then be expressed as follows:
19. | | | | i , j (v) ← u V C V dˆ (4)
20. | | | | ni , j ← ni , j ∪ v This indicates that, for large machines, as C gets larger,
21. | | | ci , j ← sendrecv( ci , j , P(i, j 1) , P(i, j 1) ) the memory requirement per node increases, as the memory
22. | | f ← n requirement of row-starts is V C . In fact, for very large
23. | | n ← {} graphs on machines with thousands of nodes, row-starts can
24. return become significantly larger than dst, making its
straightforward implementation impractical.
Figure 6. Parallel-Distributed 2-D Bottom-Up Algorithm [3] There is a set of work that proposes to compress row-
starts, such as DCSR[7] and Coarse index + Skip list [8], but
they involve non-negligible performance overhead as we Function make-offset (offset, bitmap)
describe below: 1. i ← 0
2) DCSR 2. offset[0] ← 0
DCSR[7] was proposed to improve the efficiency of 3. for each word w of bitmap
matrix-matrix multiplication in a distributed memory 4. | offset[i+1] ← offset[i] + popcount(w)
environment. The key idea is to eliminate the offset value for 5. | i ← i + 1
rows that has no non-zero values, thereby compressing row- Function row-start-end (offset, bitmap, row-starts, v)
starts. Instead, two supplemental data structures called the JC
6. w ← v / B
and AUX arrays are employed to calculate the appropriate
offset in the dst array. The drawback is that, one needs to 7. b ← (1 << (v mod B))
iterate in order to navigate over the JC array from the AUX 8. if (bitmap[w] & b) ≠ 0 then
array, resulting in significant overhead for repeated access of 9. | p ← offset[w] + popcount(bitmap[w] & (b-1))
sparse structures, which is a common operation for BFS. 10. | return (row-starts[p], row-starts[p+1])
3) Coarse index + Skip list 11. return (0, 0) // Vertex v does not have an edge
Another proposal [8] was made in order to efficiently Figure 7. Bitmap-based sparse matrix: algorithms to caculuate the offset
implement breadth-first search for 1-D partioning in a and identfying the start and end indices of a row of edges given a vertex.
distributed memory environment. 64 rows worth of non-zero
elements are batched into a skip-list, and by having the row- Edges List SRC 0067
starts hold the pointer to the skip-list, this method DST 4531
compresses the overall size of the row-starts to be 1/64th the CSR Row-starts 022222234
original size. Since each skip list embodies 64 rows of data, DST 4531
Bitmap-based Offset 013
we can traverse all 64 rows contiguously, making algorithms Sparse Matrix Bitmap 10000011
with batched row access efficient in addition to data Representation Row-starts 0234
compression. However, for sparse accesses, on average one DST 4531
would have to traverse and skip over 31 elements to access DCSR AUX 0113
the designated matrix element, potentially introducing JC 067
significant overhead. Row-starts 0234
DST 4531
4) Other Sparse Matrix Formats Figure 8. Examples of bitmap-based sparse matrix representation
There are other known sparse matrix formats that does
not utilize row-starts[9], significantly saving memory; In our bitmap-based representation, since the sequence of
however, although such formats would be useful for edges are held in row-starts in the same manner as CSR, the
algorithms that systematically iterate over all elements of a main point of the algorithm is to how to identify the starting
matrix, they perform badly for BFS where individual index of the edges given a vertex efficiently, as shown in
accesses to the edges of a given vertex needs to be efficient. Figure 7. Here, B is number of bits in a word (typically 64),
“<<” and “&” are the bit-shift and bitwise operators as in C,
IV. OUR EXTREMELY SCALABLE HYBRID BFS and mod is the modulo operator. Given a vertex v, the index
The problems associated with previous algorithms are position of v in the row-starts corresponds to the number of
largely storage and communication overheads of extremely vertices with non-zero edges, which is equivalent to the
large graphs scaling to be analyzed over thousands of nodes number of bits that are 1 leading up to the v’th position in the
or more. These are fundamental to the fact that we are bitmap. We further optimize this calculation by counting the
handling irregular, large scale “big” data structures and not summation of the number of 1 bits on a word-by-word basis,
floating point values. In order to alleviate the problems, we store it in the offset array. This effectively allows constant
propose several solutions that are unique to graph algorithms calculation of the number of non-zero bits for v by looking at
the offset value and the number of bits that are one leading
up to the v’s position in that particular word.
A. Bitmap-based Sparse Matrix Representation
First, our proposed bitmap-based sparse matrix Figure 8 shows a comparative example of bitmap-based
representation allows extremely compact representation of sparse matrix representation with 8 vertices and 4 edges. As
the adjacency matrix, while still being very efficient in we observe, much of the repetitive waste resulting from
retrieving the edges between given vertices. We compress relatively small number of edges compared to vertices
the CSR row-starts data structure by only holding the starting arising in CSR is minimized. Table I is the actual savings
position of the sequence of edges for vertices that has one or we achieve over CSR in a real setting in a Graph500
more edges, and then having an additional bitmap to identify benchmark. Here, we partition a graph with 16 billion
whether a given vertex has more than one edge or not, one vertices and 256 billion edges into 64x32 = 2048 nodes in 2-
bit per vertex. D. Here, we achieve similar level of compression as previous
work such as DCSR and Coarse index + Skip List, achieving
nearly 60% reduction in space. As we see later, this
compression is achieved with minimal execution overhead, then re-assigning the new IDs according to their order. We
in contrast to the previous proposals. do not conduct any inter-node reordering. A sub-adjacency
matrix on each node stores reordered IDs of the vertices. The
TABLE I. THEORETICAL ORDER AND THE ACTUAL PER-
mapping information between original vertex ID and its
NODE MEASURED MEMORY CONSUMPTIONS OF BITMAP-BASED reordered vertex ID is maintained by an owner node where
CSR COMPARED TO PREVIOUS PROPOSALS. WE PARTITION A the vertex is located. When constructing an adjacency matrix
GRAPH500 GRAPH WITH 16 BILLION VERTICES AND 256 BILLION of the graph, the original vertex ID is converted to the
EDGES INTO 64X32 = 2048 NODES.
reordered ID by (a) firstly performing all to all
CSR Bitmap-based CSR communication once over all the nodes in a row of processor
DATA grid in 2D partitioning to compute the degree information of
STRUCTURE Order Actual Order Actual each vertex, and then (b) secondly computing the reordered
IDs by sorting all the vertices according to their degrees and
Offset - - V C / 64 32MB then (c) thirdly performing all to all communication again
Bitmap - - V C / 64 32MB over all the nodes in a column and a row of processor grid in
order to convert the vertex IDs in the sub-adjacency matrix
row-starts Vp
VC 2048MB 190MB on each node to the reordered IDs.
Dst
V dˆ 1020MB V dˆ 1020MB The drawback with this scheme requires expensive all-to-all
Total
communication multiple times: since the resulting BFS tree
V (C dˆ ) 3068MB V (C / 32 p dˆ ) 1274MB had the reordered IDs for the vertices, we must re-assign
their original IDs. However, if we are to conduct such re-
Data DCSR Coarse index + Skip list
assignment at the very end, the information must be
Structure exchanged amongst all the nodes using a very expensive all-
Order Actual Order Actual
to-all communication for large machines again, since the
AUX Vp 190MB - - only node that has the original ID info of each vertex is the
JC
node that owns it. In fact, we show in Section VI that, all-to-
Vp 190MB - - all is a significant impediment in our benchmarks.
row-starts Vp 190MB V C / 64 32MB The solution is to add two arrays SRC(Orig) and
dst DST(Orig) as in Figure 9. Both arrays hold the original
or V dˆ 1020MB V dˆ V p 1210MB indices of the reordered vertices. When the algorithm writes
skip list to the resulting BFS tree, the original ID is referenced from
Total either of the arrays instead of the reordered ID, avoiding all-
V (3 p dˆ ) 1590MB V (C / 64 p dˆ ) 1242MB
to-all communication. A favorable by-product of vertex
reordering is removal of vertices with no edges, allowing
B. Reordering of the Vertex IDs further compaction of the data structure, since such vertices
Another associated problem with BFS is the randomness will never appear in the resulting BFS tree.
of memory accesses of graph data, in contrast to traditional
Offset 013
numerical computing using CSR such as the Conjugate
Gradient method, where the access to the row elements of a Bitmap 10000011
matrix can become contiguous. Here, we attempt to exploit SRC(Orig) 201
similar locality properties.
Row-starts 0234
The basic idea is as follows: as described in Section II.B,
DST 2301
much of the information regarding hybrid BFS are held in
bitmaps that represent the vertices, each bit corresponding to DST (Orig) 4531
a vertex. When we execute BFS over a graph, higher-degree
vertices are typically accessed more often; as such, by Figure 9. Adding the original IDs for both the source and the destination
clustering access to such vertices by reordering them
according to their degrees (i.e., # of edges), we can expect to C. Load Balancing the Top-Down Algorithm
achieve higher locality. This is similar to switching rows in a
We resolve the following load-balancing problem for the
matrix in a sparse numerical algorithm to achieve higher
top-down algorithm. As seen in Figure 5 lines 9~11, we need
locality. In [13], we proposed such reordering for top-down
to create ti,j from the edges of each vertex in the frontier; this
BFS, where they only utilize the reordered vertices where
is implemented so that the each vertex pair of the edges are
needed, while maintaining the original BFS tree with original
placed in a temporary buffer, and then copied to the
vertex IDs for overall efficiency. Unfortunately, this method
communication buffer just prior to alltoallv(). Here, as we
cannot be used for hybrid BFS; instead, we propose the
see in Figure 10, thread parallelism is utilized so that each
following algorithm.
thread gets assigned equal number of frontier vertices.
Reordered IDs of the vertices are computed by sorting them However, since the distribution of edges per each vertex is
top-down according to their degrees on a per-node basis, and
quite uneven, this will cause significant load imbalance A. Effectiveness of the Proposed Methods
among the threads. We measure the effectiveness of the proposed methods
A solution to this problem is shown in Figure 11, where using up to 15,360 nodes of the K-Computer. We increased
we conduct partitioning and thread assignment per the number of nodes in the increments of 60, with minimum
destination nodes. We first extract the range of edges, and being Scale 29 (approximately 537 million vertices and 8.59
copy the edges directly without copying into a temporary billion edges), up to Scale 37. We picked a random vertex as
buffer. In the Figure, owner(v) is a function that returns the the root of BFS, and executed each benchmark 300 times.
owner node of vertex v, and edge-range(Ai,j(:,u), k) returns The reported value is the median of the 300 runs.
the range in edge list Ai,j(:,u) for a given owner node k using 16000 Bitmap based Representation
binary search. as the edge list is sorted in destination ID 14000 DCSC
order. One caveat, however, is when the vertex has only a Coarse index + Skip list
12000
small number of edges; in such a case, the edge range data
ri,j,k could become larger and thus inefficient. We alleviate 10000
GTEPS
this problem by using a hybrid method depending on the 8000
number of edges, where we switch between the simple copy 6000
method and the range method according to the number of 4000
edges. 2000
Function top-down-sender-naive ( A i , j , f i ) 0
0 16000 32000 48000 64000
1. for u ∈ f i in parallel do # of nodes
2. | for v ∈ Ai, j (:, u) do
3. | | k ← owner( v )
4. | | t i , j ,k ← t i , j ,k ∪ {( u, v )} Figure 12. Evaluation of bitmap-based sparse matrix representation
compared to previously proposed methods (K-Computer, weak scaling)
Figure 10. Simple thread parallelism for top-down BFS
We first compared our bitmap-based sparse matrix
Function top-down-sender-load-balanced ( A i , j , f i ) representation to previous approaches, namely DCSR[7] and
1. for u ∈ f i in parallel do Coarse index + Skip list[8]. Figure 12 is the weak scaling
2. | for k ∈ P (i , :) do result of the execution performance in GTEPS, and Figure
3. | | ( v 0 ,v1 ) ← edge-range( Ai, j (:, u) , k ) 13 is the breakdown execution time. The processing of
4. | | r i , j , k ← r i , j , k ∪ {( u ,v 0 ,v1 )} “Reading Graph” in Figure 13 corresponds to Line 9-11 of
5. for k ∈ P (i , :) in parallel do Figure 5 and Line 9-16 of Figure 6. “Synchronization” is the
6. | for ( u ,v 0 ,v1 ) ∈ r i , j , k do inter thread barrier synchronization over all computation.
7. | | for v ∈ Ai, j (v0 :v1, u) do 600
8. | | | t i , j ,k ← t i , j ,k ∪ {( u, v )}
Execution ime per BFS
500
(milliseconds)
Figure 11. Load-balanced thread parallelism for top-down BFS 400
300
Synchronization
Reading Grpah
200
V. PERFORMANCE EVALUATION Other
100
We now present the results of the Graph500 benchmark
0
using our hybrid BFS on the entire K-Computer. The Bitmap based DCSC Coarse Index
Graph500 benchmark measures the performance of each Representation
machine by the TEPS (Traversed Edges Per Second) value of
the BFS algorithm on a synthetically generated Kronecker Figure 13. Performane breakdown – execution time per step (Scale 33
graph on 1008 nodes)
graphs, with parameters A=0.57, B=0.19, C=0.19, D=0.05.
The size of the graph is expressed by the Scale parameter
where the # vertices = 2Scale, and the # edges = # vertices x 16.
Figure 14 shows the effectiveness of reordering of vertex
The K Computer is located at the Riken AICS facility in ID. We compare the four methodological variations, namely
Japan, with each node embodying a 8-core Fujitsu SPARC64 1) our proposed method, 2) reorder but reassign the original
VIIIfx processor and 16 GB of memory. The Tofu network ID at the very end using alltoall(), 3) no vertex reordering,
composes a 6-dimentional torus as mentioned, with each link and 4) no vertex reordering but pre-eliminate the vertices
being bi-directional 5GB/s. The total number of nodes is with no edges. The last method 4) was introduced the assess
82,944, or embodying 663,552 CPU cores and the effectiveness of our approach more purely with respect to
approximately 1.3 Petabytes of memory. locality improvement, as 1) embodies the effect of both
locality improvement and zero-edge vertex elimination.
Figure 14 shows that, method 2) involves significant
overhead in alltoall() communication for large systems, even
trailing the non-reordered case. Method 4) shows good node sizes, the performance of “Edge-range” is almost
speedup over 3), and this is due to fact that the Graph500 identical to our proposed hybrid method. But this hybrid
graphs generated at large scale contains many vertices with method performs best of those three methods.
zero edges --- for example, for 15,360 nodes at Scale 37,
more than half the vertices have zero edges. Finally, our Figure 16 shows the cumulative effect of all the
optimization. The naïve version uses DCSR without vertex
method 1) improves upon 4), indicating that vertex
reordering, and load-balanced using the algorithm in Figure
reordering has notable merit in improving the locality..
12. By applying all the optimizations we have presented, we
16000 1. Our Proposal achieve 3.19 times speedup over the original version.
14000 2. Two-step
3. No-reordering Waiting for Synchronization
12000
4. Vertex-reduction Communication
10000
GTEPS
Overlapped Communication with Computation

8000
450 Computation
6000
400
4000
350
Execution Time (ms)

2000
300
0
250
0 16000 32000 48000 64000
200
# of nodes
150
Figure 14. Reordering of vertex IDs and comparisons to other proposed 100
methods. 50
0
16000 Our proposal 60 nodes 15360 nodes
Edge-range
14000
All-temporary-buffer
12000 Figure 17. Breakdown of performance numbers, 60 nodes vs. 15,360.
10000
Figure 17 shows the breakdown of time spent per each
GTEPS
8000 BFS for 60 and 15,360 nodes, exhibiting that the slowdown
6000 is largely due to increase in communication, despite various
4000 communication optimizations. This demonstrates that, even
2000 with an interconnect as fast as the K-Computer, network is
0
still the bottleneck for large graphs, and as such, further
0 16000 32000 48000 64000 hardware as well as algorithmic improvements are desirable
# of nodes for future extreme graph processing.
Figure 15. The effects of hybrid load balancing on top-down BFS B. Using the Entire K-Computer
16000 Naïve By using the entire K-Computer, we were able to obtain
14000 Bitmap based Representation 38,621.4 GTEPS using 82,944 nodes and 663,552 cores with
12000 Vertex Reordering a Scale 40 problem, in June, 2015. This bested the previous
10000
Load Balancing record of 23,751 GTEPS recorded by LLNL’s Sequoia
BlueGene/Q supercomputer, with 98,304 nodes and
GTEPS
8000
6000
1,572,864 cores with a Scale 41 problem.
4000
2000 VI. RELATED WORK
0 As we mentioned, Yoo proposed an effective method for
0 16000 32000 48000 64000
2-D graph partitioning for BFS in a large-scale distributed
# of nodes
memory computing environment[4]; the base algorithm itself
was a simple top-down BFS, and was evaluated on a large-
Figure 16. Cumulative effect of all the proposed optiimzations scale environment 32,768 node BlueGene/L.
Next, we investigate the effects of load-balancing in Top- Buluç et. al.[10] conducted extensive performance
Down BFS. Figure 15 shows the results, where “Edge- studies of partitioning schemes for BFS on large scale
range” is using the algorithm in Figure 11, whereas “All- machines at LNBL, Hopper (6,392 nodes) and Franklin
temporary-buffer” is using the algorithm in Figure 10 and (9,660 nodes), comparing 1-D and 2-D partitioning strategies.
“Our proposal” is the hybrid of the both. We set the Satish et.al. [11] proposed an efficient BFS algorithm on
threshold of the switchover to whether the length of the edge commodity supercomputing clusters consisting of Intel CPU
list in the partial adjacency matrix to exceed 1000. At some and the Infiniband Network. Checconi et.al. [12] proposed an
efficient parallel-distributed BFS on BlueGene using a [2] Scott Beamer, Krste Asanović and David Patterson. Direction-
communication method called “wave” that proceeds the optimizing breadth-first search. In Proceedings of the International
Conference on High Performance Computing, Networking, Storage
independently along the rows of the virtual processor grids . and Analysis (SC '12).
All the efforts here, however, uses a top-down approach only [3] Scott Beamer, Aydin Buluc, Krste Asanovic, and David Patterson.
as the underlying algorithm, and is fundamentally at a Distributed Memory Breadth-First Search Revisited: Enabling
disadvantage for graphs such as the Graph500 Kroenecker Bottom-Up Search. In Proceedings of the 2013 IEEE 27th
graph whose diameter is relatively small compared to its size, International Symposium on Parallel and Distributed Processing
as many real-world graphs are. Workshops and PhD Forum (IPDPSW '13).
[4] Andy Yoo, Edmond Chow, Keith Henderson, William McLendon,
Hybrid BFS by Beamer[2] is the seminal work that Bruce Hendrickson, and Umit Catalyurek. 2005. A Scalable
solves this problem, on which our work is based. Efficient Distributed Parallel Breadth-First Search Algorithm on BlueGene/L.
parallelization in a distributed memory environment on a In Proceedings of the 2005 ACM/IEEE conference on
Supercomputing (SC '05). IEEE Computer Society, Washington, DC,
supercomputer is much more difficult, and includes the early USA.
work by Beamer[3], and the work by Checconi[8] which [5] Yuichiro Yasui, Katsuki Fujisawa and Yukinori Sato. Fast and
uses a 1-D partitioning approach. The latter is very different Energy-efficient Breadth-First Search on a Single NUMA System.
to ours, not only in the difference in partitioning being 1-D 29th International Conference, ISC 2014, Leipzig, Germany, June 22-
compared to our 2-D, but also takes advantage of the 26, 2014.
simplicity in ingeniously replicating the vertices with large [6] J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos, Realistic,
number of edges among all the nodes, achieving very good mathematically tractable graph generation and evolution, using
kronecker multiplication, in Conf. on Principles and Practice of
overall load balancing. Performance evaluation on Knowledge Discovery in Databases, 2005.
BlueGene/Q 65536 nodes has achieved 16,599 GTEPS, and [7] Aydın Buluç and John R. Gilbert. On the Representation and
it would be interesting to consider utilizing some of the Multiplication of Hypersparse Matrices. Parallel and Distributed
strategies in our work. Processing Symposium 2008 (IPDPS’08).
[8] Fabio Checconi, et. al. Traversing Trillions of Edges in Real-time:
Graph Exploration on Large-scale Parallel Machines. Parallel and
VII. CONCLUSION Distributed Processing Symposium 2014 (IPDPS’14).
For many graphs we see in the real world, with relatively [9] Eurípides Montagne and Anand Ekambaram. An optimal storage
small diameter compared to its size, hybrid BFS is known to format for sparse matrices. Journal Information Processing Letters
be very efficient. The problem has been that, although Volume 90 Issue 2, 30 April 2004 Pages 87 - 92.
various algorithms have been proposed to parallelize the [10] Aydin Buluç and Kamesh Madduri. 2011. Parallel breadth-first search
on distributed memory systems. In Proceedings of 2011 International
algorithm in a distributed-memory environment, such as the Conference for High Performance Computing, Networking, Storage
work by Beamer[3] using 2-D partitioning, the algorithms and Analysis (SC '11). ACM, New York, NY, USA, Article 65 , 12
failed to scale or be efficient for modern machines with tens pages. DOI=10.1145/2063384.2063471
of thousands of nodes and million-scale cores, due to the http://doi.acm.org/10.1145/2063384.2063471
increase in memory and communication requirements [11] Satish, Nadathur and Kim, Changkyu and Chhugani, Jatin and Dubey,
overwhelming even the best machines. Our proposed hybrid Pradeep. Large-scale energy-efficient graph traversal: a path to
efficient data-intensive supercomputing. SC '12, 2012.
BFS algorithm overcomes such problems by combination of
various new techniques, such as bitmap-based sparse matrix [12] Fabio Checconi, Fabrizio Petrini, Jeremiah Willcock, Andrew
Lumsdaine, Anamitra Roy Choudhury, Yogish Sabharwal. Breaking
representation, reordering of vertex ID, as well as new the speed and scalability barriers for graph exploration on distributed-
methods for communication optimization and load memory machines. SC '12, 2012.
balancing. Detailed performance on the K-computer [13] Koji Ueno and Toyotaro Suzumura "Highly Scalable Graph Search
revealed the effectiveness of each of our approach, with the for the Graph500 Benchmark" HPDC 2012 (The 21st International
combined effect of all achieving over x3 speedup over ACM Symposium on High-Performance Parallel and Distributed
previous approaches, and scaling to the entire 82,944 nodes Computing) 2012/6, Delft, Netherlands.
of the machine effectively. The resulting performance of [14] Koji Ueno and Toyotaro Suzumura, "Parallel Distributed Breadth
First Search on GPU", HiPC 2013 (IEEE International Conference on
38,621.4 GTEPS allowed the K-Computer to be ranked High Performance Computing), India, 2013/12
No.1 on the Graph500 in June 2015 by a significant margin, [15] Toyotaro Suzumura, Koji Ueno, Hitoshi Sato, Katsuki Fujisawa and
and it has retained this rank to this date as of June 2016. Satoshi Matsuoka, "Performance Characteristics of Graph500 on
Large-Scale Distributed Environment", IEEE IISWC 2011 ( IEEE
ACKNOWLEDGMENT International Symposium on Workload Characterization) , 2011/11,
Austin, TX, US.
This research was partly supported by the Japan Science
[16] Yokokawa et al,. “The K computer: Japanese next-generation
and Technology Agency's CREST project titled supercomputer development project”, Proceedings of the 17th
"Development of System Software Technologies for post- IEEE/ACM international symposium on Low-power electronics and
Peta Scale High Performance Computing". design (ISLPED ’11), 2011
[17] Ajima et al., ‘‘The Tofu Interconnect,’’Proc. IEEE 19th Ann. Symp.
REFERENCES High Performance Interconnects (HOTI 11), IEEE CS Press, 2011, pp.
[1] Graph500 : http://www.graph500.org/ 87-94.

Graph500 BigData2016 Paper

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Graph500 BigData2016 Paper

Загружено:

Авторское право:

Доступные форматы

Extreme Scale Breadth-First Search on Supercomputers

Koji Ueno Toyotaro Suzumura Naoya Maruyama

Katsuki Fujisawa Satoshi Matsuoka

Figure 11. Load-balanced thread parallelism for top-down BFS 400

Overlapped Communication with Computation

Execution Time (ms)

Вам также может понравиться