Академический Документы
Профессиональный Документы
Культура Документы
CMPE 239
Web and Data Mining
MapReduce
Big Data
Web
Classical
Need
1/29/14
the CLOUD
1/29/14
Commodity Clusters
Web
How
MapReduce
THE HARDWARE & FILE SYSTEM
1/29/14
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack
Switch
Switch
CPU
Mem
Disk
Switch
CPU
CPU
Mem
Mem
Disk
Disk
CPU
Mem
Disk
Stable storage
First
We
Need
to:
1/29/14
Stable storage
Enforcing
Data Redundancy:
Distributed File System
(DFS)
usage pattern
Chunk Servers
Master node
a.k.a. Name Nodes in HDFS
Stores metadata
Might be replicated
1/29/14
fork
assign
map
Input Data
Split 0 read
Split 1
Split 2
Master
fork
assign
reduce
Worker
Worker
Worker
local
write
Worker
Worker
write
Output
File 0
Output
File 1
remote
read,
sort
MapReduce
THE SOFTWARE
1/29/14
MapReduce
Programming style
Enable large-scale computations
Two
in a
steps:
Map
Reduce
(extension) Might be recursive
1/29/14
1/29/14
Intermediate
key-value pairs
k
map
map
Output
key-value pairs
Key-value groups
reduce
reduce
group
1/29/14
MapReduce
Input: a set of
User supplies
key/value pairs
two functions:
map(k,v) list(k1,v1)
reduce(k1, list(v1)) (k1, v2)
(k1,v1) is an intermediate key/value
Output is the set of (k1,v2) pairs
pair
reduce(key, values):"
// key: a word; values: an iterator over counts"
"result = 0"
"for each v in values:"
"
"result += v"
"emit(result)"
10
1/29/14
fork
assign
map
Input Data
Master
fork
assign
reduce
Worker
Split 0 read
Split 1
Split 2
Worker
Worker
local
write
Worker
Worker
write
Output
File 0
Output
File 1
remote
read,
sort
Combiners
Often
Can
Works
11
1/29/14
For
MapReduce
EXECUTION DETAILS
12
1/29/14
File System
Input, final
Intermediate
Partition Function
Inputs
13
1/29/14
Coordination
Master
data structure
Master
failures
Failures
What is the worst
Master failure
failure
14
1/29/14
Backup tasks
Straggler: worker
to finish task
Near
reduce tasks
Usually
Why?
R is smaller than M
15
1/29/14
MapReduce
DISCUSSION
MapReduce Advantages
Easy to use
General enough
16
1/29/14
MapReduce Disadvantages
One-input, two-phase
(operator) data
flow is rigid, and hard to adapt
Procedural programming model requires
(often repetitive) code for even the
simplest operations (e.g. projection)
Opaque nature of the map and reduce
functions impedes optimization
Not one-size-fits-all solution
Need to program in Java
Too many lines of code even for simple tasks
MapReduce Ecosystem
IMPLEMENTATIONS &
PROGRAMMING LANGUAGES
17
1/29/14
Implementations
Hadoop
Amazon EMR
Hyracks
Programming Languages
Pig
Hive
& HiveQL
18
1/29/14
Cloud Computing
Ability
Amazons
Googles App
Engine
Mahout:
http://mahout.apache.org/
Scalable
19
1/29/14
list:
src1: dest11, dest12,
src2: dest21, dest22,
Construct
MapReduce
READING
20
1/29/14
Reading
Ch. 2.1, 2.2 (Ullmans book)
Jeffrey Dean and Sanjay Ghemawat,
MapReduce: Simplified Data Processing on
Large Clusters
http://research.google.com/archive/mapreduce.html
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
The Google File System
http://research.google.com/archive/gfs.html
Mahout in Action Ch.1
http://www.manning.com/owen/MiA_SampleCh01.pdf
To Do
Install WEKA, R
Enroll
and Mahout
HW#1
later J
21