Вы находитесь на странице: 1из 57

Big Data: Graph Processing

COS 418: Distributed Systems


Lecture 21

Kyle Jamieson

[Content adapted from J. Gonzalez]


Patient ate

which
Also sold contains
Diagnoses
Patient
to presents
purchased abdominal
from pain.

with Diagnosis?
E. Coli
infection
Big Data is Everywhere

6 Billion 900 Million 72 Hours a Minute


Flickr Photos 28 Million Facebook Users YouTube
Wikipedia Pages

Machine learning is a reality

How will we design and implement Big


Learning systems?
3
We could use .

Threads, Locks, & Messages

Low-level parallel primitives


Shift Towards Use Of Parallelism in
ML

GPUs Multicore Clusters Clouds Supercomputers

Programmers repeatedly solve the same parallel


design challenges:
Race conditions, distributed state, communication
Resulting code is very specialized:
Difficult to maintain, extend, debug

Idea: Avoid these problems by using


high-level abstractions 5
... a better answer:

MapReduce / Hadoop

Build learning algorithms on top of


high-level parallel abstractions
MapReduce Map Phase

1 4 2 2

CPU 1 CPU 2 CPU 3 CPU 4


2 2 1 5
. . . .
9 3 3 8

Embarrassingly Parallel independent computation


No Communication needed
7
MapReduce Map Phase

2 8 1 8

CPU 1 CPU 2 CPU 3 CPU 4


4 4 8 4
. . . .
1 3 4 4

1 4 2 2
2 2 1 5
. . . .
9 3 3 8

Image Features

8
MapReduce Map Phase

1 6 1 3

CPU 1 CPU 2 CPU 3 CPU 4


7 7 4 4
. . . .
5 5 9 3

1 2 4 8 2 1 2 8
2 4 2 4 1 8 5 4
. . . . . . . .
9 1 3 3 3 4 8 4

Embarrassingly Parallel independent computation


9
MapReduce Reduce Phase
Outdoor Picture Indoor
Statistics Picture Statistics

22 17
Outdoor Indoor
CPU 1 CPU 2
26 26
Pictures . . Pictures
26 31

1 2 1 4 8 6 2 1 1 2 8 3
2 4 7 2 4 7 1 8 4 5 4 4
. . . . . . . . . . . .
9 1 5 3 3 5 3 4 9 8 4 3
I O O I I I O O I O I I

Image Features
10
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-Parallel Graph-Parallel
Is there more to
Machine Learning
Map Reduce
Feature
Extraction
Algorithm
Tuning
Lasso
?
Label Propagation

Kernel
Methods
Belief
Propagation
Basic Data Processing Tensor PageRank
Factorization
Deep Belief Neural
Networks Networks
11
Exploiting Dependencies
Graphs are Everywhere
Social Network Collaborative Filtering

Users
Netflix

Movies

Probabilistic Analysis Text Analysis

Docs Wiki

Words
Concrete Example

Label Propagation
Label Propagation Algorithm
Social Arithmetic: Sue Ann
50% What I list on my profile 80% Cameras
40%
40% Sue Ann Likes 20% Biking
+ 10% Carlos Like
I Like: 60% Cameras, 40% Biking
Profile
50%
Recurrence Algorithm: 50% Cameras

Me
50% Biking
Likes[i] = Wij Likes[ j]
jFriends[i]
Carlos
iterate until convergence 10% 30% Cameras
70% Biking
Parallelism:
Compute all Likes[i] in parallel
Properties of Graph Parallel Algorithms

Dependency Factored Iterative


Graph Computation Computation

What I Like

What My
Friends Like
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-Parallel Graph-Parallel

MapReduce MapReduce?
Feature Algorithm Label Propagation
Lasso
Extraction Tuning Belief
Kernel
Propagation
Methods
Basic Data Processing
Tensor PageRank
Factorization
Deep Belief Neural
Networks Networks
17
Problem: Data Dependencies
MapReduce doesnt efficiently express
data dependencies
User must code substantial data transformations
Costly data replication

Independent Data Rows


Iterative Algorithms
MR doesnt efficiently express iterative algorithms:
Iterations

Data Data Data Data


CPU 1 CPU 1 CPU 1
Data Data Data Data

Data Data Data Data


Processor
Slow

CPU 2 CPU 2 CPU 2


Data Data Data Data

Data Data Data Data

CPU 3 CPU 3 CPU 3


Data Data Data Data

Data Data Data Data


Barrier
Barrier

Barrier
MapAbuse: Iterative MapReduce
Only a subset of data needs computation:
Iterations

Data Data Data Data


CPU 1 CPU 1 CPU 1
Data Data Data Data

Data Data Data Data


CPU 2 CPU 2 CPU 2
Data Data Data Data

Data Data Data Data

CPU 3 CPU 3 CPU 3


Data Data Data Data

Data Data Data Data


Barrier
Barrier

Barrier
MapAbuse: Iterative MapReduce
System is not optimized for iteration:
Iterations

Data Data Data Data


CPU 1 CPU 1 CPU 1
Data Data Data Data
Startup Penalty

Startup Penalty

Startup Penalty
Disk Penalty

Disk Penalty

Disk Penalty
Data Data Data Data
CPU 2 CPU 2 CPU 2
Data Data Data Data

Data Data Data Data

CPU 3 CPU 3 CPU 3


Data Data Data Data

Data Data Data Data


ML Tasks Beyond Data-Parallelism

Data-Parallel Graph-Parallel

Map Reduce
Feature Cross
Extraction Validation Graphical Models Semi-Supervised
Gibbs Sampling Learning
Computing Sufficient Belief Propagation Label Propagation
Statistics Variational Opt. CoEM
Collaborative Graph Analysis
Filtering PageRank
Tensor Factorization Triangle Counting

22
Limited CPU Power
Limited Memory
Limited Scalability

23
Distributed Cloud

Scale up computational resources!

Challenges:
- Distribute state
- Keep data consistent
- Provide fault tolerance

24
The GraphLab Framework
Graph Based Update Functions
Data Representation User Computation

Consistency Model

25
Data Graph
Data is associated with both vertices and edges

Graph:
Social Network

Vertex Data:
User profile
Current interests estimates

Edge Data:
Relationship
(friend, classmate, relative)

26
Distributed Data Graph
Partition the graph across multiple machines:

27
Distributed Data Graph
Ghost vertices maintain adjacency structure
and replicate remote data.

ghost vertices

28
Distributed Data Graph
Cut efficiently using HPC Graph partitioning
tools (ParMetis / Scotch / )

ghost vertices

29
The GraphLab Framework
Graph Based Update Functions
Data Representation User Computation

Consistency Model

30
Update Function
A user-defined program, applied to a
vertex; transforms data in scope of vertex
Pagerank(scope){
// Update the current vertex data
Update function applied (asynchronously)
vertex.PageRank =a

in parallel until convergence


ForEach inPage:
vertex.PageRank += (1- a ) inPage.PageRank
Many schedulers available to prioritize computation
// Reschedule Neighbors if needed
if vertex.PageRank changes then
reschedule_all_neighbors;
}
Selectively triggers
computation at neighbors
31
Distributed Scheduling
Each machine maintains a schedule over the vertices it owns

a f

h a b c d g

e f g

h i j k

Distributed Consensus used to identify completion 32


Ensuring Race-Free Code
How much can computation overlap?

33
The GraphLab Framework
Graph Based Update Functions
Data Representation User Computation

Consistency Model

34
PageRank Revisited
Pagerank(scope) {
vertex.PageRank = a
ForEach inPage:
vertex.PageRank += (1- a ) inPage.PageRank
vertex.PageRank = tmp

}

35
PageRank data races confound convergence

36
Racing PageRank: Bug
Pagerank(scope) {
vertex.PageRank = a
ForEach inPage:
vertex.PageRank += (1- a ) inPage.PageRank
vertex.PageRank = tmp

}

37
Racing PageRank: Bug Fix
Pagerank(scope) {
tmp = a
vertex.PageRank
ForEach inPage:
tmp += (1- a ) inPage.PageRank
vertex.PageRank
vertex.PageRank = tmp

}

38
Throughput != Performance
Higher
Throughput
(#updates/sec)

No Consistency

Potentially Slower
Convergence of ML
39
Serializability
For every parallel execution, there exists a sequential execution
of update functions which produces the same result.

time
CPU 1
Parallel

CPU 2

Sequential Single
CPU 40
Serializability Example
Write

Stronger / Weaker
consistency levels availableRead

User-tunable consistency levels


trades off parallelism & consistency
Overlapping regions
are only read.
Update functions one vertex apart can be run in parallel.
Edge Consistency
41
Distributed Consistency
Solution 1: Chromatic Engine
Edge Consistency via Graph Coloring

Solution 2: Distributed Locking


Chromatic Distributed Engine

Execute tasks
on all vertices of Execute tasks
color 0 on all vertices of
color 0

Ghost Synchronization Completion + Barrier


Time

Execute tasks
Execute tasks on all vertices of
on all vertices of color 1
color 1

Ghost Synchronization Completion + Barrier

43
Matrix Factorization
Netflix Collaborative Filtering
Alternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
Users Movies
Users

Netflix

D D
Movies
44
Netflix Collaborative Filtering
Speedup vs 4 machines

4
10
16
Ideal
14 d=100 (30M Cycles)
d=50 Ideal
(7.7M Cycles)
12 3
10 HadoopMPI MPI

Runtime(s)
d=20 (2.1M Cycles) Hadoop
D=100
10 d=5 (1.0M Cycles)
8 GraphLab
D=20 2
6 10
4
2 GraphLab
1
1 10
4 8 16 24 32 40 48 56 64 4 8 16 24 32 40 48 56 64
#Nodes
# machines #Nodes
# machines

(D = 20)

45
Distributed Consistency
Solution 1: Chromatic Engine
Edge Consistency via Graph Coloring
Requires a graph coloring to be available
Frequent barriers inefficient when only some
vertices active

Solution 2: Distributed Locking


Distributed Locking
Edge Consistency can be guaranteed through locking.

: RW Lock

47
Consistency Through Locking
Acquire write-lock on center vertex, read-lock on adjacent.

Performance problem: Acquiring a lock from a


neighboring machine incurs a latency penalty
48
Simple locking

lock scope 1

Process request 1
Time

scope 1 acquired
update_function 1
release scope 1

Process release 1

49
Pipelining hides latency
GraphLab Idea: Hide latency using pipelining

lock scope 1
lock scope 2
Process request 1
lock scope 3
Time

Process request 2
scope 1 acquired Process request 3
scope 2 acquired
scope 3 acquired
update_function 1
release scope 1
update_function 2
Process release 1
release scope 2 50
Distributed Consistency
Solution 1: Chromatic Engine
Edge Consistency via Graph Coloring
Requires a graph coloring to be available
Frequent barriers inefficient when only some
vertices active

Solution 2: Distributed Locking


Residual BP on 190K-vertex/560K-edge graph, 4
machines
No pipelining: 472 sec; with pipelining: 10 sec
How to handle machine failure?
What when machines fail? How do we
provide fault tolerance?

Strawman scheme: Synchronous snapshot


checkpointing
1. Stop the world
2. Write each machines state to disk
Snapshot Performance
8
x 10
2.5 no snapshot
No Snapshot
2
Snapshot
vertices updated

How can we do better, leveraging


1.5
GraphLabs consistency mechanisms?
async. snapshot
1 Snapshot time One slow
machine
0.5 sync. snapshot

Slow machine
0
0 50 100 150
time elapsed(s) 53
Chandy-Lamport checkpointing
Step 1. Atomically one initiator
(a) Turns red, (b) Records its own state
(c) sends marker to neighbors

Step 2. On receiving marker non-red


node atomically: (a) Turns red,
(b) Records its own state, (c) sends
markers along all outgoing channels First-in, first-
out channels
between nodes
Implemented within GraphLab as an Update Function
Async. Snapshot Performance
8
x 10
2.5 no snapshot
No Snapshot
2
Snapshot
vertices updated

1.5

async.
Onesnapshot
slow
1
machine
0.5 No system performance penalty
sync. snapshot
incurred from the slow machine!
0
0 50 100 150
time elapsed(s) 55
Summary
Two different methods of achieving consistency
Graph Coloring
Distributed Locking with pipelining
Efficient implementations
Asynchronous FT w/fine-grained Chandy-Lamport

Performance Efficiency Scalability

Useability
56
Friday Precept:
Roofnet performance
More Graph Processing

Monday topic:
Streaming Data Processing

57

Вам также может понравиться