L21 Graph Processing

Big Data: Graph Processing
COS 418: Distributed Systems

Lecture 21
Kyle Jamieson
[Content adapted from J. Gonzalez]

Patient ate
which
Also sold contains
Diagnoses
Patient
to presents
purchased abdominal
from pain.
with Diagnosis?
E. Coli
infection
Big Data is Everywhere
6 Billion 900 Million 72 Hours a Minute

Flickr Photos 28 Million Facebook Users YouTube
Wikipedia Pages
Machine learning is a reality
How will we design and implement Big

Learning systems?
3
We could use .
Threads, Locks, & Messages
Low-level parallel primitives

Shift Towards Use Of Parallelism in
ML
GPUs Multicore Clusters Clouds Supercomputers
Programmers repeatedly solve the same parallel

design challenges:
Race conditions, distributed state, communication
Resulting code is very specialized:
Difficult to maintain, extend, debug
Idea: Avoid these problems by using

high-level abstractions 5
... a better answer:
MapReduce / Hadoop
Build learning algorithms on top of

high-level parallel abstractions
MapReduce Map Phase
1 4 2 2
CPU 1 CPU 2 CPU 3 CPU 4

2 2 1 5
. . . .
9 3 3 8
Embarrassingly Parallel independent computation

No Communication needed
7
MapReduce Map Phase
2 8 1 8

4 4 8 4
. . . .
1 3 4 4
1 4 2 2
2 2 1 5
. . . .
9 3 3 8
Image Features
8
MapReduce Map Phase
1 6 1 3

7 7 4 4
. . . .
5 5 9 3
1 2 4 8 2 1 2 8
2 4 2 4 1 8 5 4
. . . . . . . .
9 1 3 3 3 4 8 4
Embarrassingly Parallel independent computation

9
MapReduce Reduce Phase
Outdoor Picture Indoor
Statistics Picture Statistics
22 17
Outdoor Indoor
CPU 1 CPU 2
26 26
Pictures . . Pictures
26 31
1 2 1 4 8 6 2 1 1 2 8 3
2 4 7 2 4 7 1 8 4 5 4 4
. . . . . . . . . . . .
9 1 5 3 3 5 3 4 9 8 4 3
I O O I I I O O I O I I
Image Features
10
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-Parallel Graph-Parallel
Is there more to
Machine Learning
Map Reduce
Feature
Extraction
Algorithm
Tuning
Lasso
?
Label Propagation
Kernel
Methods
Belief
Propagation
Basic Data Processing Tensor PageRank
Factorization
Deep Belief Neural
Networks Networks
11
Exploiting Dependencies
Graphs are Everywhere
Social Network Collaborative Filtering
Users
Netflix
Movies
Probabilistic Analysis Text Analysis
Docs Wiki
Words
Concrete Example
Label Propagation
Label Propagation Algorithm
Social Arithmetic: Sue Ann
50% What I list on my profile 80% Cameras
40%
40% Sue Ann Likes 20% Biking
+ 10% Carlos Like
I Like: 60% Cameras, 40% Biking
Profile
50%
Recurrence Algorithm: 50% Cameras
Me
50% Biking
Likes[i] = Wij Likes[ j]
jFriends[i]
Carlos
iterate until convergence 10% 30% Cameras
70% Biking
Parallelism:
Compute all Likes[i] in parallel
Properties of Graph Parallel Algorithms
Dependency Factored Iterative

Graph Computation Computation
What I Like
What My
Friends Like
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
MapReduce MapReduce?
Feature Algorithm Label Propagation
Lasso
Extraction Tuning Belief
Kernel
Propagation
Methods
Basic Data Processing
Tensor PageRank
Factorization
Deep Belief Neural
Networks Networks
17
Problem: Data Dependencies
MapReduce doesnt efficiently express
data dependencies
User must code substantial data transformations
Costly data replication
Independent Data Rows

Iterative Algorithms
MR doesnt efficiently express iterative algorithms:
Iterations
Data Data Data Data

CPU 1 CPU 1 CPU 1
Data Data Data Data
Data Data Data Data

Processor
Slow
CPU 2 CPU 2 CPU 2

Data Data Data Data
Data Data Data Data
CPU 3 CPU 3 CPU 3

Data Data Data Data
Data Data Data Data

Barrier
Barrier
Barrier
MapAbuse: Iterative MapReduce
Only a subset of data needs computation:
Iterations
Data Data Data Data

CPU 1 CPU 1 CPU 1
Data Data Data Data
Data Data Data Data

CPU 2 CPU 2 CPU 2
Data Data Data Data
Data Data Data Data
CPU 3 CPU 3 CPU 3

Data Data Data Data
Data Data Data Data

Barrier
Barrier
Barrier
MapAbuse: Iterative MapReduce
System is not optimized for iteration:
Iterations
Data Data Data Data

CPU 1 CPU 1 CPU 1
Data Data Data Data
Startup Penalty
Startup Penalty
Startup Penalty
Disk Penalty
Disk Penalty
Disk Penalty
Data Data Data Data
CPU 2 CPU 2 CPU 2
Data Data Data Data
Data Data Data Data
CPU 3 CPU 3 CPU 3

Data Data Data Data
Data Data Data Data

ML Tasks Beyond Data-Parallelism
Map Reduce
Feature Cross
Extraction Validation Graphical Models Semi-Supervised
Gibbs Sampling Learning
Computing Sufficient Belief Propagation Label Propagation
Statistics Variational Opt. CoEM
Collaborative Graph Analysis
Filtering PageRank
Tensor Factorization Triangle Counting
22
Limited CPU Power
Limited Memory
Limited Scalability
23
Distributed Cloud
Scale up computational resources!
Challenges:
- Distribute state
- Keep data consistent
- Provide fault tolerance
24
The GraphLab Framework
Graph Based Update Functions
Data Representation User Computation
Consistency Model
25
Data Graph
Data is associated with both vertices and edges
Graph:
Social Network
Vertex Data:
User profile
Current interests estimates
Edge Data:
Relationship
(friend, classmate, relative)
26
Distributed Data Graph
Partition the graph across multiple machines:
27
Ghost vertices maintain adjacency structure
and replicate remote data.
ghost vertices
28
Cut efficiently using HPC Graph partitioning
tools (ParMetis / Scotch / )
ghost vertices
29
Consistency Model
30
Update Function
A user-defined program, applied to a
vertex; transforms data in scope of vertex
Pagerank(scope){
// Update the current vertex data
Update function applied (asynchronously)
vertex.PageRank =a
in parallel until convergence

ForEach inPage:
vertex.PageRank += (1- a ) inPage.PageRank
Many schedulers available to prioritize computation
// Reschedule Neighbors if needed
if vertex.PageRank changes then
reschedule_all_neighbors;
}
Selectively triggers
computation at neighbors
31
Distributed Scheduling
Each machine maintains a schedule over the vertices it owns
a f
h a b c d g
e f g
h i j k
Distributed Consensus used to identify completion 32

Ensuring Race-Free Code
How much can computation overlap?
33
Consistency Model
34
PageRank Revisited
Pagerank(scope) {
vertex.PageRank = a
ForEach inPage:
vertex.PageRank = tmp

}
35
PageRank data races confound convergence
36
Racing PageRank: Bug
Pagerank(scope) {
vertex.PageRank = a
ForEach inPage:

}
37
Racing PageRank: Bug Fix
Pagerank(scope) {
tmp = a
vertex.PageRank
ForEach inPage:
tmp += (1- a ) inPage.PageRank
vertex.PageRank

}
38
Throughput != Performance
Higher
Throughput
(#updates/sec)
No Consistency
Potentially Slower
Convergence of ML
39
Serializability
For every parallel execution, there exists a sequential execution
of update functions which produces the same result.
time
CPU 1
Parallel
CPU 2
Sequential Single
CPU 40
Serializability Example
Write
Stronger / Weaker
consistency levels availableRead
User-tunable consistency levels

trades off parallelism & consistency
Overlapping regions
are only read.
Update functions one vertex apart can be run in parallel.
Edge Consistency
41
Distributed Consistency
Solution 1: Chromatic Engine
Edge Consistency via Graph Coloring
Solution 2: Distributed Locking

Chromatic Distributed Engine
Execute tasks
on all vertices of Execute tasks
color 0 on all vertices of
color 0
Ghost Synchronization Completion + Barrier

Time
Execute tasks
Execute tasks on all vertices of
on all vertices of color 1
color 1
Ghost Synchronization Completion + Barrier
43
Matrix Factorization
Netflix Collaborative Filtering
Alternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
Users Movies
Users
Netflix
D D
Movies
44
Netflix Collaborative Filtering
Speedup vs 4 machines
4
10
16
Ideal
14 d=100 (30M Cycles)
d=50 Ideal
(7.7M Cycles)
12 3
10 HadoopMPI MPI
Runtime(s)
d=20 (2.1M Cycles) Hadoop
D=100
10 d=5 (1.0M Cycles)
8 GraphLab
D=20 2
6 10
4
2 GraphLab
1
1 10
4 8 16 24 32 40 48 56 64 4 8 16 24 32 40 48 56 64
#Nodes
# machines #Nodes
# machines
(D = 20)
45
Requires a graph coloring to be available
Frequent barriers inefficient when only some
vertices active

Distributed Locking
Edge Consistency can be guaranteed through locking.
: RW Lock
47
Consistency Through Locking
Acquire write-lock on center vertex, read-lock on adjacent.
Performance problem: Acquiring a lock from a

neighboring machine incurs a latency penalty
48
Simple locking
lock scope 1
Process request 1
Time
scope 1 acquired
update_function 1
release scope 1
Process release 1
49
Pipelining hides latency
GraphLab Idea: Hide latency using pipelining
lock scope 1
lock scope 2
Process request 1
lock scope 3
Time
Process request 2
scope 1 acquired Process request 3
scope 2 acquired
scope 3 acquired
update_function 1
release scope 1
update_function 2
Process release 1
release scope 2 50
Requires a graph coloring to be available
Frequent barriers inefficient when only some
vertices active

Residual BP on 190K-vertex/560K-edge graph, 4
machines
No pipelining: 472 sec; with pipelining: 10 sec
How to handle machine failure?
What when machines fail? How do we
provide fault tolerance?
Strawman scheme: Synchronous snapshot

checkpointing
1. Stop the world
2. Write each machines state to disk
Snapshot Performance
8
x 10
2.5 no snapshot
No Snapshot
2
Snapshot
vertices updated
How can we do better, leveraging

1.5
GraphLabs consistency mechanisms?
async. snapshot
1 Snapshot time One slow
machine
0.5 sync. snapshot
Slow machine
0
0 50 100 150
time elapsed(s) 53
Chandy-Lamport checkpointing
Step 1. Atomically one initiator
(a) Turns red, (b) Records its own state
(c) sends marker to neighbors
Step 2. On receiving marker non-red

node atomically: (a) Turns red,
(b) Records its own state, (c) sends
markers along all outgoing channels First-in, first-
out channels
between nodes
Implemented within GraphLab as an Update Function
Async. Snapshot Performance
8
x 10
2.5 no snapshot
No Snapshot
2
Snapshot
vertices updated
1.5
async.
Onesnapshot
slow
1
machine
0.5 No system performance penalty
sync. snapshot
incurred from the slow machine!
0
0 50 100 150
time elapsed(s) 55
Summary
Two different methods of achieving consistency
Graph Coloring
Distributed Locking with pipelining
Efficient implementations
Asynchronous FT w/fine-grained Chandy-Lamport
Performance Efficiency Scalability
Useability
56
Friday Precept:
Roofnet performance
More Graph Processing
Monday topic:
Streaming Data Processing
57

L21 Graph Processing

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

L21 Graph Processing

Загружено:

Авторское право:

Доступные форматы

Big Data: Graph Processing

COS 418: Distributed Systems

[Content adapted from J. Gonzalez]

6 Billion 900 Million 72 Hours a Minute

Machine learning is a reality

How will we design and implement Big

Threads, Locks, & Messages

Low-level parallel primitives

GPUs Multicore Clusters Clouds Supercomputers

Programmers repeatedly solve the same parallel

Idea: Avoid these problems by using

Build learning algorithms on top of

CPU 1 CPU 2 CPU 3 CPU 4

Embarrassingly Parallel independent computation

CPU 1 CPU 2 CPU 3 CPU 4

CPU 1 CPU 2 CPU 3 CPU 4

Embarrassingly Parallel independent computation

Probabilistic Analysis Text Analysis

Dependency Factored Iterative

Independent Data Rows

Data Data Data Data

Data Data Data Data

CPU 2 CPU 2 CPU 2

Data Data Data Data

CPU 3 CPU 3 CPU 3

Data Data Data Data

Data Data Data Data

Data Data Data Data

Data Data Data Data

CPU 3 CPU 3 CPU 3

Data Data Data Data

Data Data Data Data

Data Data Data Data

CPU 3 CPU 3 CPU 3

Data Data Data Data

Scale up computational resources!

in parallel until convergence

Distributed Consensus used to identify completion 32

User-tunable consistency levels

Solution 2: Distributed Locking

Ghost Synchronization Completion + Barrier

Ghost Synchronization Completion + Barrier

Solution 2: Distributed Locking

Performance problem: Acquiring a lock from a

Solution 2: Distributed Locking

Strawman scheme: Synchronous snapshot

How can we do better, leveraging

Step 2. On receiving marker non-red

Performance Efficiency Scalability

Вам также может понравиться