Академический Документы
Профессиональный Документы
Культура Документы
with Spark
Hossein Falaki
Challenges of numerical
computation over big data
When applying any algorithm to big data watch for
1. Correctness
2. Performance
3. Trade-off between accuracy and performance
1
2
Var(X) = (xi )
N i=1
First pass
4
N
N
2
Accumulator Pattern
An object that incrementally tracks the variance
Class RunningVar {
var variance: Double = 0.0
!
2. Approximate Estimations
Cardinality Problem
Example: Count number of unique words in Shakespeares
work.
10
v
count m ln
m
11
12
3. Google PageRank
Popular algorithm originally introduced by Google
13
PageRank Algorithm
PageRank Algorithm
On each iteration:
curRank
A. contrib =
| neighbors |
neighbors
14
contribi
PageRank Example
1.0
1.0
1.0
1.0
15
PageRank Example
1.0
0.5
1.0
1.0
1.0
1.0
0.5
0.5
1.0
16
PageRank Example
1.85
0.58
1.0
0.58
17
PageRank Example
1.85
0.5
0.58
1.85
0.58
1.0
0.29
0.5
0.58
18
PageRank Example
1.31
0.39
1.72
0.58
19
PageRank Example
1.44
0.46
Eventually
0.73
20
1.37
PageRank as Matrix
Multiplication
Vk = A V
k
Data Representation in
Spark
22
Spark Implementation
val links = // load RDD of (url, neighbors) pairs
var ranks = // load RDD of (url, rank) pairs
!
23
Matrix Multiplication
Links
(url, neighbors)
Ranks
(url, rank)
iteration 1
iteration 2
24
iteration 3
Links
(url, neighbors)
Ranks
(url, rank)
join
25
join
join
Links
(url, neighbors)
Same node
Ranks
(url, rank)
join
26
join
join
Spark Implementation
val links = // load RDD of (url, neighbors) pairs
var ranks = // load RDD of (url, rank) pairs
!
links.partitionBy(hashFunction).cache()
!
Conclusions
When applying any algorithm to big data watch for
1. Correctness
2. Performance
Numerical Computing
with Spark