Hossein PDF

Numerical Computing
with Spark
Hossein Falaki
Challenges of numerical
computation over big data
When applying any algorithm to big data watch for
1. Correctness
2. Performance
3. Trade-off between accuracy and performance
Three Practical Examples
Point estimation (Variance)
Approximate estimation (Cardinality)
Matrix operations (PageRank)
We use these examples to demonstrate Spark internals, data flow,

and challenges of implementing algorithms for Big Data.
1. Big Data Variance

The plain variance formula requires two passes
over data
Second pass
N
1
2
Var(X) = (xi )
N i=1
First pass
4
Fast but inaccurate solution

Var(X) = E[X ] E[X]
2
N
N
2
Can be performed in a single pass, but

Subtracts two very close and large numbers!
5
Accumulator Pattern
An object that incrementally tracks the variance
Class RunningVar {
var variance: Double = 0.0
!
// Compute initial variance for numbers

def this(numbers: Iterator[Double]) {
numbers.foreach(this.add(_))
}
!
// Update variance for a single value

def add(value: Double) {
...
}
}
6
Parallelize for performance
Distribute adding values in map phase
Merge partial results in reduce phase

Class RunningVar {
...
// Merge another RunningVar object
// and update variance
def merge(other: RunningVar) = {
...
}
}
7
Computing Variance in Spark
Use the RunningVar in Spark

doubleRDD
.mapPartitions(v => Iterator(new RunningVar(v)))
.reduce((a, b) => a.merge(b))
Or simply use the Spark API

doubleRDD.variance()
2. Approximate Estimations
Often an approximate estimate is good enough

especially if it can be computed faster or cheaper
1. Trade accuracy with memory
2. Trade accuracy with running time
We really like the cases where there is a bound on

error that can be controlled
Cardinality Problem
Example: Count number of unique words in Shakespeares
work.
Using a HashSet requires ~10GB of memory
This can be much worse in many real world

applications involving large strings, such as
counting web visitors
10
Linear Probabilistic Counting

1. Allocate a bitmap of size m and initialize to zero.
A. Hash each value to a position in the bitmap
B. Set corresponding bit to 1
2. Count number of empty bit entries: v
v
count m ln
m
11
The Spark API
Use the LogLinearCounter in Spark

rdd
.mapPartitions(v => Iterator(new LPCounter(v)))
.reduce((a, b) => a.merge(b)).getCardinality
Or simply use the Spark API

myRDD.countApproxDistinct(0.01)
12
3. Google PageRank
Popular algorithm originally introduced by Google
13
PageRank Algorithm
PageRank Algorithm
Start each page with a rank of 1
On each iteration:
curRank
A. contrib =
| neighbors |
B. curRank = 0.15 + 0.85
neighbors
14
contribi
PageRank Example
1.0
1.0
1.0
1.0
15
PageRank Example
1.0
0.5
1.0
1.0
1.0
1.0
0.5
0.5
1.0
16
PageRank Example
1.85
0.58
1.0
0.58
17
PageRank Example
1.85
0.5
0.58
1.85
0.58
1.0
0.29
0.5
0.58
18
PageRank Example
1.31
0.39
1.72
0.58
19
PageRank Example
1.44
0.46
Eventually
0.73
20
1.37
PageRank as Matrix
Multiplication
Rank of each page is the probability of landing on

that page for a random surfer on the web
Probability of visiting all pages after k steps is
Vk = A V
k
V: the initial rank vector

A: the link structure (sparse matrix)
21
Data Representation in
Spark
Each page is identified by its unique URL rather

than an index
Ranks vectors (V): RDD[(URL, Double)]
Links matrix (A): RDD[(URL, List(URL))]
22
Spark Implementation
val links = // load RDD of (url, neighbors) pairs
var ranks = // load RDD of (url, rank) pairs
!
for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
23
Matrix Multiplication
Repeatedly multiply sparse matrix and vector

Same file read!
over and over
Links
(url, neighbors)
Ranks
(url, rank)
iteration 1
iteration 2
24
iteration 3
Spark can do much better
Using cache(), keep neighbors in memory
Do not write intermediate results on disk

Grouping same RDD!
over and over
Links
(url, neighbors)
Ranks
(url, rank)
join
25
join
join
Spark can do much better
Do not partition neighbors every time

partitionBy
Links
(url, neighbors)
Same node
Ranks
(url, rank)
join
26
join
join
Spark Implementation
val links = // load RDD of (url, neighbors) pairs
var ranks = // load RDD of (url, rank) pairs
!
links.partitionBy(hashFunction).cache()
!
for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
27
Conclusions
When applying any algorithm to big data watch for
1. Correctness
2. Performance
Cache RDDs to avoid I/O
Avoid unnecessary computation
3. Trade-off between accuracy and performance

28
Numerical Computing
with Spark

Hossein PDF

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hossein PDF

Загружено:

Авторское право:

Доступные форматы

Numerical Computing

Three Practical Examples

Point estimation (Variance)

Approximate estimation (Cardinality)

Matrix operations (PageRank)

We use these examples to demonstrate Spark internals, data flow,

1. Big Data Variance

Fast but inaccurate solution

Can be performed in a single pass, but

// Compute initial variance for numbers

// Update variance for a single value

Parallelize for performance

Distribute adding values in map phase

Merge partial results in reduce phase

Computing Variance in Spark

Use the RunningVar in Spark

Or simply use the Spark API

Often an approximate estimate is good enough

We really like the cases where there is a bound on

Using a HashSet requires ~10GB of memory

This can be much worse in many real world

Linear Probabilistic Counting

The Spark API

Use the LogLinearCounter in Spark

Or simply use the Spark API

Start each page with a rank of 1

B. curRank = 0.15 + 0.85

Rank of each page is the probability of landing on

Probability of visiting all pages after k steps is

V: the initial rank vector

Each page is identified by its unique URL rather

Ranks vectors (V): RDD[(URL, Double)]

Links matrix (A): RDD[(URL, List(URL))]

for (i <- 1 to ITERATIONS) {

Repeatedly multiply sparse matrix and vector

Spark can do much better

Using cache(), keep neighbors in memory

Do not write intermediate results on disk

Spark can do much better

Do not partition neighbors every time

for (i <- 1 to ITERATIONS) {

Cache RDDs to avoid I/O

Avoid unnecessary computation

3. Trade-off between accuracy and performance

Вам также может понравиться