Mapreduce

1/29/14
CMPE 239
Web and Data Mining
MapReduce
Slides based on those of Prof. Ullman & Prof. Rajaraman (Stanford U)

Examples and figures from MapReduce: Simplified Data Processing on Large
Clusters by J. Dean and S. Ghemawat (OSDI 2004)
Big Data
Web
applications generate huge amounts

of data
10s to 100s of TBs
Examples?
Classical
data mining (based on singlenode architecture) does not always work
Need
to parallelize the mining process
1/29/14
Is parallel processing new?

No!
Large
scientific datasets were processed

in special-purpose parallel computers
So, what is new?

Hardware
and communication is much

cheaper
We can use 1000s of commodity
hardware instead of the very expensive
dedicated parallel machines to form.
the CLOUD
1/29/14
Commodity Clusters
Web
data sets can be very large
Tens to hundreds of terabytes
Cannot mine on a single server (why?)

Standard architecture emerging:
Cluster of commodity Linux nodes

Gigabit ethernet interconnect
How
to organize computations on this

architecture?
Mask issues such as hardware failure
MapReduce
THE HARDWARE & FILE SYSTEM
1/29/14
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack
Switch
Switch
CPU
Mem
Disk
Switch
CPU
CPU
Mem
Mem
Disk
Disk
CPU
Mem
Disk
Each rack contains 16-64 nodes
Stable storage
First
order problem: if nodes can fail, how

can we store data persistently?
We
should avoid restarting tasks that might

take hours to complete
Need
to:
Enforce data redundancy

Break problem into tasks. Redoing one should
not affect the rest.
1/29/14
Stable storage
Enforcing
Data Redundancy:
Distributed File System (DFS)
Provides global file namespace

Google GFS; Hadoop HDFS; Kosmos KFS (now
QFS);
Typical
usage pattern
Huge files (100s of GB to TB)

Data is rarely updated in place
Reads and appends are common
CAUTION: Not all applications are good candidates!!!
Example?
Distributed File System
Chunk Servers
File is split into contiguous chunks

Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
Master node
a.k.a. Name Nodes in HDFS
Stores metadata
Might be replicated
Client library for file access

Talks to master to find chunk servers
Connects directly to chunk servers to access data
1/29/14
Distributed Execution Overview

User
Program
fork
fork
assign
map
Input Data
Split 0 read
Split 1
Split 2
Master
fork
assign
reduce
Worker
Worker
Worker
local
write
Worker
Worker
write
Output
File 0
Output
File 1
remote
read,
sort
MapReduce
THE SOFTWARE
1/29/14
MapReduce
Programming style
Enable large-scale computations
(hardware) fault-taulerant way
Two
in a
steps:
Map
Reduce
(extension) Might be recursive
Warm up: Word Count

We
have a large file of words, one word

to a line
Count the number of times each distinct
word appears in the file
Sample applications:
Analyze web server logs to find popular URLs
Analyze articles to classify in categories (e.g.
news stories, scientific papers)
Build a text index
1/29/14
Word Count (2)

Case
Case
1: Entire file fits in memory

2: File too large for memory, but all
<word, count> pairs fits in memory
Case 3: File on disk, too many distinct
words to fit in memory
Word Count (3)

To
make it slightly harder, suppose we

have a large corpus of documents
Count the number of times each distinct
word occurs in the corpus
The
above captures the essence of

MapReduce
Great thing is it is naturally parallelizable
1/29/14
MapReduce: The Map Step

Input
key-value pairs
Intermediate
key-value pairs
k
map
map
MapReduce: The Reduce Step

Intermediate
key-value pairs
Output
key-value pairs
Key-value groups
reduce
reduce
group
1/29/14
MapReduce
Input: a set of
User supplies
key/value pairs
two functions:
map(k,v) list(k1,v1)
reduce(k1, list(v1)) (k1, v2)
(k1,v1) is an intermediate key/value
Output is the set of (k1,v2) pairs
pair
Word Count using MapReduce

map(key, value):"
// key: document name; value: document content"
"for each word w in value:"
"
"emitIntermediate(w, 1)"
reduce(key, values):"
// key: a word; values: an iterator over counts"
"result = 0"
"for each v in values:"
"
"result += v"
"emit(result)"
10
1/29/14
Distributed Execution Overview

User
Program
fork
fork
assign
map
Input Data
Master
fork
assign
reduce
Worker
Split 0 read
Split 1
Split 2
Worker
Worker
local
write
Worker
Worker
write
Output
File 0
Output
File 1
remote
read,
sort
Combiners
Often
a map task will produce many pairs of

the form (k,v1), (k,v2), for the same key k
E.g., popular words in Word Count
Can
save network time by pre-aggregating at

mapper
combine(k1, list(v1)) (k1, v2)
Usually same as reduce function
Works
only if reduce function is commutative

and associative
i.e. the values can be combined in any order, with
the same result
11
1/29/14
Exercise 1: Host size

Suppose we have a large web corpus
Lets look at the metadata file
Lines of the form (URL, size, date, )
For
each host, find the total number of

bytes
i.e., the sum of the page sizes for all URLs
from that host
MapReduce
EXECUTION DETAILS
12
1/29/14
File System
Input, final
output are stored on a

distributed file system
Scheduler tries to schedule map tasks close
to physical storage location of input data
Intermediate
results are stored on local

FS of map and reduce workers
Output is often input to another mapreduce task
Partition Function
Inputs
to map tasks are created by

contiguous splits of input file
For reduce, we need to ensure that
records with the same intermediate key
end up at the same worker
System uses a default partition function
e.g., hash(key) mod R
13
1/29/14
Coordination
Master
data structure
Master
pings workers periodically to detect
Tracks task state: (idle, in-progress, completed)

Idle tasks get scheduled as workers become
available
When a map task completes, it sends the master
the location and sizes of its intermediate files,
Master pushes this info to reducers
failures
Failures
What is the worst
Master failure
that can happen?
MapReduce task is aborted and client is notified

Worker
failure
Detected by master through periodic pings

Handled via re-execution
Redo completed or in-progress map tasks
Why completed tasks?
Redo in-progress reduce tasks

Map/Reduce tasks committed through master
14
1/29/14
Backup tasks
Straggler: worker
to finish task
that takes unusually long
Possible causes include bad disks, network

issues, overloaded machines
Near
the end of the map/reduce phase,

master spawns backup copies of
remaining tasks
Use workers that completed their task
already
First that finishes wins
How many Map and Reduce jobs?

M map tasks, R
Rule of thumb:
reduce tasks
Make M and R much larger than the number

of nodes in cluster
Improves dynamic load balancing and speeds
recovery from worker failure
Best to assign splits to nearby workers
Usually
Why?
R is smaller than M
15
1/29/14
MapReduce
DISCUSSION
MapReduce Advantages
Easy to use
General enough
for expressing many

practical problems
Hides parallelization and fault recovery
details
Scales well, beyond 1000s of machines and
TBs of data
16
1/29/14
MapReduce Disadvantages
One-input, two-phase
(operator) data
flow is rigid, and hard to adapt
Procedural programming model requires
(often repetitive) code for even the
simplest operations (e.g. projection)
Opaque nature of the map and reduce
functions impedes optimization
Not one-size-fits-all solution
Need to program in Java
Too many lines of code even for simple tasks
MapReduce Ecosystem
IMPLEMENTATIONS &
PROGRAMMING LANGUAGES
17
1/29/14
Implementations
Google
Hadoop
Amazon EMR

Hyracks
Not available outside Google

Uses GFS
An open-source implementation in Java
Uses HDFS for stable storage
Download: http://lucene.apache.org/hadoop/
Using Hadoop on EC2 and S3
http://aws.amazon.com/elasticmapreduce/
Open-source framework for parallel data flows by
UC Irvine
Not based on MapReduce
Provides more operators
Programming Languages
Pig
& Pig Latin
Layer on top of Hadoop

Pig: system
Pig Latin: language (hybrid between declarative
query language such as SQL and low-level
procedural language such as C++)
Hive
& HiveQL
Data warehouse system for Hadoop

Hive: system
HiveQL: language (fully declarative, SQL-like,
supports custom mapreduce scripts)
18
1/29/14
Cloud Computing
Ability
to rent computing by the hour
Additional services e.g., persistent storage
Amazons
Elastic Compute Cloud (EC2)
Amazon generously provided us with AWS

Educational Grant for this class!
Googles App
Engine
MapReduce & Data Mining

Apache
Mahout:
http://mahout.apache.org/
Scalable
Machine Learning library
Not all implementations are parallelized

though
19
1/29/14
Exercise 2: Graph reversal

Given
a directed graph as an adjacency
list:
src1: dest11, dest12,
src2: dest21, dest22,
Construct
the graph in which all the links

are reversed
MapReduce
READING
20
1/29/14
Reading
Ch. 2.1, 2.2 (Ullmans book)
Jeffrey Dean and Sanjay Ghemawat,
MapReduce: Simplified Data Processing on
Large Clusters
http://research.google.com/archive/mapreduce.html
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
The Google File System
http://research.google.com/archive/gfs.html
Mahout in Action Ch.1
http://www.manning.com/owen/MiA_SampleCh01.pdf
To Do
Install WEKA, R
Enroll
and Mahout
to Piazza (via Canvas)
HW#1
later J
21

Mapreduce

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Mapreduce

Загружено:

Авторское право:

Доступные форматы

1/29/14

Slides based on those of Prof. Ullman & Prof. Rajaraman (Stanford U)

applications generate huge amounts

data mining (based on singlenode architecture) does not always work

to parallelize the mining process

Is parallel processing new?

scientific datasets were processed

So, what is new?

and communication is much

data sets can be very large

Tens to hundreds of terabytes

Cannot mine on a single server (why?)

Cluster of commodity Linux nodes

to organize computations on this

Each rack contains 16-64 nodes

order problem: if nodes can fail, how

should avoid restarting tasks that might

Enforce data redundancy

Provides global file namespace

Huge files (100s of GB to TB)

Distributed File System

File is split into contiguous chunks

Client library for file access

Distributed Execution Overview

(hardware) fault-taulerant way

Warm up: Word Count

have a large file of words, one word

Word Count (2)

1: Entire file fits in memory

Word Count (3)

make it slightly harder, suppose we

above captures the essence of

MapReduce: The Map Step

MapReduce: The Reduce Step

Word Count using MapReduce

Distributed Execution Overview

a map task will produce many pairs of

save network time by pre-aggregating at

only if reduce function is commutative

Exercise 1: Host size

Lines of the form (URL, size, date, )

each host, find the total number of

output are stored on a

results are stored on local

to map tasks are created by

pings workers periodically to detect

Tracks task state: (idle, in-progress, completed)

that can happen?

MapReduce task is aborted and client is notified

Detected by master through periodic pings

Redo in-progress reduce tasks

that takes unusually long

Possible causes include bad disks, network

the end of the map/reduce phase,

How many Map and Reduce jobs?

Make M and R much larger than the number

for expressing many

Not available outside Google

& Pig Latin

Layer on top of Hadoop

Data warehouse system for Hadoop

to rent computing by the hour

Additional services e.g., persistent storage

Elastic Compute Cloud (EC2)