Вы находитесь на странице: 1из 21

1/29/14

CMPE 239
Web and Data Mining

MapReduce

Slides based on those of Prof. Ullman & Prof. Rajaraman (Stanford U)


Examples and figures from MapReduce: Simplified Data Processing on Large
Clusters by J. Dean and S. Ghemawat (OSDI 2004)

Big Data
Web

applications generate huge amounts


of data
10s to 100s of TBs
Examples?

Classical

data mining (based on singlenode architecture) does not always work

Need

to parallelize the mining process

1/29/14

Is parallel processing new?


No!
Large

scientific datasets were processed


in special-purpose parallel computers

So, what is new?


Hardware

and communication is much


cheaper
We can use 1000s of commodity
hardware instead of the very expensive
dedicated parallel machines to form.

the CLOUD

1/29/14

Commodity Clusters
Web

data sets can be very large

Tens to hundreds of terabytes

Cannot mine on a single server (why?)


Standard architecture emerging:

Cluster of commodity Linux nodes


Gigabit ethernet interconnect

How

to organize computations on this


architecture?
Mask issues such as hardware failure

MapReduce
THE HARDWARE & FILE SYSTEM

1/29/14

Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack

Switch

Switch

CPU
Mem

Disk

Switch

CPU

CPU

Mem

Mem

Disk

Disk

CPU

Mem
Disk

Each rack contains 16-64 nodes

Stable storage
First

order problem: if nodes can fail, how


can we store data persistently?

We

should avoid restarting tasks that might


take hours to complete

Need

to:

Enforce data redundancy


Break problem into tasks. Redoing one should
not affect the rest.

1/29/14

Stable storage
Enforcing

Data Redundancy:
Distributed File System (DFS)

Provides global file namespace


Google GFS; Hadoop HDFS; Kosmos KFS (now
QFS);
Typical

usage pattern

Huge files (100s of GB to TB)


Data is rarely updated in place
Reads and appends are common
CAUTION: Not all applications are good candidates!!!
Example?

Distributed File System

Chunk Servers

File is split into contiguous chunks


Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

Master node
a.k.a. Name Nodes in HDFS
Stores metadata
Might be replicated

Client library for file access


Talks to master to find chunk servers
Connects directly to chunk servers to access data

1/29/14

Distributed Execution Overview


User
Program
fork

fork
assign
map
Input Data
Split 0 read
Split 1
Split 2

Master

fork

assign
reduce

Worker

Worker

Worker

local
write

Worker
Worker

write

Output
File 0
Output
File 1

remote
read,
sort

MapReduce
THE SOFTWARE

1/29/14

MapReduce
Programming style
Enable large-scale computations

(hardware) fault-taulerant way

Two

in a

steps:

Map
Reduce
(extension) Might be recursive

Warm up: Word Count


We

have a large file of words, one word


to a line
Count the number of times each distinct
word appears in the file
Sample applications:
Analyze web server logs to find popular URLs
Analyze articles to classify in categories (e.g.
news stories, scientific papers)
Build a text index

1/29/14

Word Count (2)


Case
Case

1: Entire file fits in memory


2: File too large for memory, but all
<word, count> pairs fits in memory
Case 3: File on disk, too many distinct
words to fit in memory

Word Count (3)


To

make it slightly harder, suppose we


have a large corpus of documents
Count the number of times each distinct
word occurs in the corpus
The

above captures the essence of


MapReduce
Great thing is it is naturally parallelizable

1/29/14

MapReduce: The Map Step


Input
key-value pairs

Intermediate
key-value pairs
k

map
map

MapReduce: The Reduce Step


Intermediate
key-value pairs

Output
key-value pairs

Key-value groups

reduce

reduce
group

1/29/14

MapReduce
Input: a set of
User supplies

key/value pairs
two functions:

map(k,v) list(k1,v1)
reduce(k1, list(v1)) (k1, v2)
(k1,v1) is an intermediate key/value
Output is the set of (k1,v2) pairs

pair

Word Count using MapReduce


map(key, value):"
// key: document name; value: document content"
"for each word w in value:"
"
"emitIntermediate(w, 1)"

reduce(key, values):"
// key: a word; values: an iterator over counts"
"result = 0"
"for each v in values:"
"
"result += v"
"emit(result)"

10

1/29/14

Distributed Execution Overview


User
Program
fork

fork
assign
map
Input Data

Master

fork

assign
reduce

Worker

Split 0 read
Split 1
Split 2

Worker

Worker

local
write

Worker
Worker

write

Output
File 0
Output
File 1

remote
read,
sort

Combiners
Often

a map task will produce many pairs of


the form (k,v1), (k,v2), for the same key k
E.g., popular words in Word Count

Can

save network time by pre-aggregating at


mapper
combine(k1, list(v1)) (k1, v2)
Usually same as reduce function

Works

only if reduce function is commutative


and associative
i.e. the values can be combined in any order, with
the same result

11

1/29/14

Exercise 1: Host size


Suppose we have a large web corpus
Lets look at the metadata file

Lines of the form (URL, size, date, )

For

each host, find the total number of


bytes
i.e., the sum of the page sizes for all URLs
from that host

MapReduce
EXECUTION DETAILS

12

1/29/14

File System
Input, final

output are stored on a


distributed file system
Scheduler tries to schedule map tasks close
to physical storage location of input data

Intermediate

results are stored on local


FS of map and reduce workers
Output is often input to another mapreduce task

Partition Function
Inputs

to map tasks are created by


contiguous splits of input file
For reduce, we need to ensure that
records with the same intermediate key
end up at the same worker
System uses a default partition function
e.g., hash(key) mod R

13

1/29/14

Coordination
Master

data structure

Master

pings workers periodically to detect

Tracks task state: (idle, in-progress, completed)


Idle tasks get scheduled as workers become
available
When a map task completes, it sends the master
the location and sizes of its intermediate files,
Master pushes this info to reducers

failures

Failures
What is the worst
Master failure

that can happen?

MapReduce task is aborted and client is notified


Worker

failure

Detected by master through periodic pings


Handled via re-execution
Redo completed or in-progress map tasks
Why completed tasks?

Redo in-progress reduce tasks


Map/Reduce tasks committed through master

14

1/29/14

Backup tasks
Straggler: worker

to finish task

that takes unusually long

Possible causes include bad disks, network


issues, overloaded machines

Near

the end of the map/reduce phase,


master spawns backup copies of
remaining tasks
Use workers that completed their task
already
First that finishes wins

How many Map and Reduce jobs?


M map tasks, R
Rule of thumb:

reduce tasks

Make M and R much larger than the number


of nodes in cluster
Improves dynamic load balancing and speeds
recovery from worker failure
Best to assign splits to nearby workers

Usually

Why?

R is smaller than M

15

1/29/14

MapReduce
DISCUSSION

MapReduce Advantages
Easy to use
General enough

for expressing many


practical problems
Hides parallelization and fault recovery
details
Scales well, beyond 1000s of machines and
TBs of data

16

1/29/14

MapReduce Disadvantages
One-input, two-phase

(operator) data
flow is rigid, and hard to adapt
Procedural programming model requires
(often repetitive) code for even the
simplest operations (e.g. projection)
Opaque nature of the map and reduce
functions impedes optimization
Not one-size-fits-all solution
Need to program in Java
Too many lines of code even for simple tasks

MapReduce Ecosystem
IMPLEMENTATIONS &
PROGRAMMING LANGUAGES

17

1/29/14

Implementations

Google

Hadoop

Amazon EMR

Hyracks

Not available outside Google


Uses GFS
An open-source implementation in Java
Uses HDFS for stable storage
Download: http://lucene.apache.org/hadoop/
Using Hadoop on EC2 and S3
http://aws.amazon.com/elasticmapreduce/
Open-source framework for parallel data flows by
UC Irvine
Not based on MapReduce
Provides more operators

Programming Languages
Pig

& Pig Latin

Layer on top of Hadoop


Pig: system
Pig Latin: language (hybrid between declarative
query language such as SQL and low-level
procedural language such as C++)

Hive

& HiveQL

Data warehouse system for Hadoop


Hive: system
HiveQL: language (fully declarative, SQL-like,
supports custom mapreduce scripts)

18

1/29/14

Cloud Computing
Ability

to rent computing by the hour

Additional services e.g., persistent storage

Amazons

Elastic Compute Cloud (EC2)

Amazon generously provided us with AWS


Educational Grant for this class!

Googles App

Engine

MapReduce & Data Mining


Apache

Mahout:
http://mahout.apache.org/

Scalable

Machine Learning library

Not all implementations are parallelized


though

19

1/29/14

Exercise 2: Graph reversal


Given

a directed graph as an adjacency

list:
src1: dest11, dest12,
src2: dest21, dest22,
Construct

the graph in which all the links


are reversed

MapReduce
READING

20

1/29/14

Reading
Ch. 2.1, 2.2 (Ullmans book)
Jeffrey Dean and Sanjay Ghemawat,
MapReduce: Simplified Data Processing on
Large Clusters
http://research.google.com/archive/mapreduce.html
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
The Google File System

http://research.google.com/archive/gfs.html
Mahout in Action Ch.1
http://www.manning.com/owen/MiA_SampleCh01.pdf

To Do
Install WEKA, R
Enroll

and Mahout

to Piazza (via Canvas)

HW#1

later J

21

Вам также может понравиться