Вы находитесь на странице: 1из 2

Lecture Notes:

Introduction to Map Reduce:


Why is Map Reduce so cool/scalable?
Consider if different stacks works at different rates
Naturally, you'll want the stacks to finish at the same time, so you'd g
ive the slower stacks less to do
That's basically what MapReduce does
Two kinds of workers:
Those that take input data items and produce output items for the stacks
Those that take the stacks and aggregate the results to produce outputs
on a per-stack basis
We call these map and reduce
Programming Model:
Modeled after Lisp
"map" applies a function to all items in a collection
map takes in some input list and applies it to all eleme
nts
"reduce" applies a function to a set of items w/ a common key
think "fold" in OCaml, which combines a list
Example:
Count how many times each word is printed in the Library of Congress
First thing to define is the map
Let "k" be the title of the book and "v" be the entire contents
of it
map(k, v)
{
S = v.split(" ")
for each (word in S)
emit(word, 1) //pushes to reduce s.t. "1" is add
ed to some list being compiled in relation to "word"
}
Next we define reduce
//NOTE HERE that key is a word and values is the list/collection
of all the "1"'s grouped together
reduce(key, values)
{
emit(key, values.size) //imagine the key was "horse", va
lues could be {1, 1, 1, 1}, so values.size tells us the # of occurences
}
What's going on under the hood?
There's a bunch of Mappers, which every time they emit send the k-v pair
to the proper Reducer
Overall process is Input Data -> "The Shuffle" -> Output Data
You don't really get to touch anything in between
Wait so why is Map Reduce so scalable?
B/c the Map and Reduce stages are both completely parallellisable (they
don't rely on any other piece of data)
Example 2:
We want to search the Internet for everywhere that contains the word "sy

zygy"
First let's define the map
Key = URL, Value = webpage as a string
map(k, v)
{
String[] words = v.split(" ")
for each (w : words)
if(w == "syzygy")
emit(w, key)
}
Then we define reduce:
Key = word
Value = set of URLs
reduce(k, v)
{
for each(u : v)
emit(u, "found")
}
But here the issue is that we are't really uses reduce. We kinda just wa
nt map
KEY TAKEAWAY: YOU DON'T ALWAYS HAVE TO USE MAP-REDUCE
Example 3:
Inverted Index
Map: key = URL, value = String
map(k, v)
{
String[] words = v.split(" ")
for each (w : words)
emit(w, k)
}
Reduce: key = String, value = set of URLs
reduce(k, v)
{
emit(k,v)
}
Briefly, (R+W>N) ensures that any read quorum and any write quorum will overlap
in at least one node. Without this condition, it could happen that, say, one cli
ent writes to servers A, B, and C (W=3), and then another client reads from serv
ers D and E (R=2). The second client wouldn't see the data that the first client
has written, so this approach doesn't provide strong consistency.
Re: conflicting writes, you'll want 2*W>N, so that any two write quora will over
lap in at least one node. If two clients write concurrently, they'll at least di
scover the conflict, and one or both of them can then retry.
The larger your R or W, the longer the corresponding reads or writes will take,
since more nodes have to be contacted, and the chances that at least one straggl
er is in the set will increase. So a small R is good for fast reads, and a smal
l W is good for fast writes. If you also want strong consistency, you can't have
both, so fast reads will need to be paid for with slow writes, or vice versa.
To prevent data loss, you'll want as many copies of your most recent data as pos
sible, so you'll want a large W (in absolute terms, which requires a large N as
well).

Вам также может понравиться