Distributed Computing

Distributed Computing
Varun Thacker
Linux Users Group Manipal
April 8, 2010
Varun Thacker (LUG Manipal)
April 8, 2010
1 / 42
Outline I
1
Introduction
LUG Manipal
Points To Remember
Technologies to be covered
Idea
Data !!
Why Distributed Computing is Hard
Why Distributed Computing is Important
Three Common Distributed Architectures
Distributed File System

GFS
What a Distributed File System Does
Google File System Architecture
GFS Architecture: Chunks
April 8, 2010
2 / 42
Outline II
GFS Architecture: Master
GFS: Life of a Read
GFS: Life of a Write
GFS: Master Failure
4
MapReduce
MapReduce
Do We Need It?
Bad News!
MapReduce
Map Reduce Paradigm
MapReduce Paradigm
Working
Working
Under the hood: Scheduling
Robustness
Hadoop
April 8, 2010
3 / 42
Outline III
Hadoop
What is Hadoop
Who uses Hadoop?
Mapper
Combiners
Reducer
Some Terminology
Job Distribution
6
Contact Information
Attribution
Copying
April 8, 2010
4 / 42
Who are we?
April 8, 2010
5 / 42
Who are we?

Life, Universe and FOSS!!
April 8, 2010
5 / 42
Who are we?

Believers of Knowledge Sharing
April 8, 2010
5 / 42
Who are we?

Most technologically focused group in University
April 8, 2010
5 / 42
Who are we?

LUG Manipal is a non profit Group alive only on voluntary work!!
April 8, 2010
5 / 42
Who are we?

LUG Manipal is a non profit Group alive only on voluntary work!!
http://lugmanipal.org
April 8, 2010
5 / 42
Points To Remember!!!
If you have problem(s) dont hesitate to ask
April 8, 2010
6 / 42

Slides are based on Documentation so discussions are really
important, slides are for later reference!!
April 8, 2010
6 / 42

Please dont consider sessions as Class( Classes are boring !! )
April 8, 2010
6 / 42

Speaker is just like any person sitting next to you
April 8, 2010
6 / 42

Documentation is really important
April 8, 2010
6 / 42

Google is your friend
April 8, 2010
6 / 42

Google is your friend
If you have questions after this workshop mail me or come to LUG
Manipals forums
http://forums.lugmanipal.org
April 8, 2010
6 / 42
April 8, 2010
7 / 42
Distributed computing refers to the use of distributed systems to

solve computational problems on the distributed system.
April 8, 2010
8 / 42

A distributed system consists of multiple computers that
communicate through a network.
April 8, 2010
8 / 42

MapReduce is a framework which implements the idea of a
distributed computing.
April 8, 2010
8 / 42

GFS is the distributed file system on which distributed programs store
and process data in Google. Its free implementation is HDFS.
April 8, 2010
8 / 42

GFS is the distributed file system on which distributed programs store
and process data in Google. Its free implementation is HDFS.
Hadoop is an open source framework written in Java which
implements the MapReduce technology.
April 8, 2010
8 / 42
Idea
While the storage capacities of hard drives have increased massively

over the years, access speedsthe rate at which data can be read
from drives have not kept up.
April 8, 2010
9 / 42
Idea

One terabyte drives are the norm, but the transfer speed is around
100 MB/s, so it takes more than two and a half hours to read all the
data off the disk.
April 8, 2010
9 / 42
Idea

One terabyte drives are the norm, but the transfer speed is around
100 MB/s, so it takes more than two and a half hours to read all the
data off the disk.
The obvious way to reduce the time is to read from multiple disks at
once. Imagine if we had 100 drives, each holding one hundredth of
the data. Working in parallel, we could read the data in under two
minutes.
April 8, 2010
9 / 42
Data
We live in the data age.An IDC estimate put the size of the digital
universe at 0.18 zettabytes(?) in 2006.
April 8, 2010
10 / 42
Data
And by 2011 there will be a tenfold growth to 1.8 zettabytes.
April 8, 2010
10 / 42
Data
1 zetabyte is one million petabytes, or one billion terabytes.
April 8, 2010
10 / 42
Data
The New York Stock Exchange generates about one terabyte of new
trade data per day.
April 8, 2010
10 / 42
Data
trade data per day.
Facebook hosts approximately 10 billion photos, taking up one
petabyte of storage.
April 8, 2010
10 / 42
Data
trade data per day.
Facebook hosts approximately 10 billion photos, taking up one
petabyte of storage.
The Large Hadron Collider near Geneva produces about 15 petabytes
of data per year.
April 8, 2010
10 / 42
Computers crash.
April 8, 2010
11 / 42
Computers crash.
Network links crash.
April 8, 2010
11 / 42
Computers crash.
Talking is slow(even ethernet has 300 microsecond latency, during
which time your 2Ghz PC can do 600,000 cycles).
April 8, 2010
11 / 42
Computers crash.
Bandwidth is finite.
April 8, 2010
11 / 42
Computers crash.
Bandwidth is finite.
Internet scale: the computers and network are
heterogeneous,untrustworthy, and subject to change at any time.
April 8, 2010
11 / 42
Can be more reliable.
April 8, 2010
12 / 42

Can be faster.
April 8, 2010
12 / 42

Can be faster.
Can be cheaper ($30 million Cray versus 100 $1000 PCs).
April 8, 2010
12 / 42
Hope: have N computers do separate pieces of work. Speed-up < N.

Probability of failure = 1(1 p)N Np. (p = probability of
individual crash).
April 8, 2010
13 / 42

individual crash).
Replication: have N computers do the same thing. Speed-up < 1.
Probability of failure = p N .
April 8, 2010
13 / 42

individual crash).
Replication: have N computers do the same thing. Speed-up < 1.
Probability of failure = p N .
Master-servant: have 1 computer hand out pieces of work to N-1
servants, and re-hand out pieces of work if servants fail. Speed-up
< N 1. Probability of failure p.
April 8, 2010
13 / 42
GFS
GFS
April 8, 2010
14 / 42
Usual file system stuff: create, read, move & find files.
April 8, 2010
15 / 42
Allow distributed access to files.
April 8, 2010
15 / 42
Files are stored distributedly.
April 8, 2010
15 / 42
If you just do #1 and #2, you are a network file system.
April 8, 2010
15 / 42
If you just do #1 and #2, you are a network file system.
To do #3, its a good idea to also provide fault tolerance.
April 8, 2010
15 / 42
GFS Architecture
April 8, 2010
16 / 42
Files are divided into 64 MB chunks (last chunk of a file may be

smaller).
April 8, 2010
17 / 42

smaller).
Each chunk is identified by an unique 64-bit id.
April 8, 2010
17 / 42

smaller).
Chunks are stored as regular files on local disks.
April 8, 2010
17 / 42

smaller).
By default, each chunk is stored thrice, preferably on more than one
rack.
April 8, 2010
17 / 42

smaller).
rack.
To protect data integrity, each 64 KB block gets a 32 bit checksum
that is checked on all reads.
April 8, 2010
17 / 42

smaller).
rack.
To protect data integrity, each 64 KB block gets a 32 bit checksum
that is checked on all reads.
When idle, a chunkserver scans inactive chunks for corruption.
April 8, 2010
17 / 42
Stores all metadata (namespace, access control).
April 8, 2010
18 / 42

Stores (file > chunks) and (chunk > location) mappings.
April 8, 2010
18 / 42

Clients get chunk locations for a file from the master, and then talk
directly to the chunkservers for the data.
April 8, 2010
18 / 42

Advantage of single master simplicity.
April 8, 2010
18 / 42

Disadvantages of single master:
April 8, 2010
18 / 42

Metadata operations are bottlenecked.
April 8, 2010
18 / 42

Metadata operations are bottlenecked.
Maximum Number of files limited by masters memory.
April 8, 2010
18 / 42
GFS: Life of a Read
Client program asks for 1 Gb of file A starting at the 200 millionth

byte.
April 8, 2010
19 / 42
GFS: Life of a Read

byte.
Client GFS library asks master for chunks 3, ... 16387 of file A.
April 8, 2010
19 / 42
GFS: Life of a Read

byte.
Master responds with all of the locations of chunks 2, ... 20000 of file
A.
April 8, 2010
19 / 42
GFS: Life of a Read

byte.
A.
Client caches all of these locations (with their cache time-outs)
April 8, 2010
19 / 42
GFS: Life of a Read

byte.
A.
Client reads chunk 2 from the closest location.
April 8, 2010
19 / 42
GFS: Life of a Read

byte.
A.
April 8, 2010
19 / 42
GFS: Life of a Read

byte.
A.
...
April 8, 2010
19 / 42
Client gets locations of chunk replicas as before.
April 8, 2010
20 / 42

For each chunk, client sends the write data to nearest replica.
April 8, 2010
20 / 42

This replica sends the data to the nearest replica to it that has not
yet received the data.
April 8, 2010
20 / 42

When all of the replicas have received the data, then it is safe for
them to actually write it.
April 8, 2010
20 / 42

Tricky Details:
April 8, 2010
20 / 42

Tricky Details:
Master hands out a short term ( 1 minute) lease for a particular
replica to be the primary one.
April 8, 2010
20 / 42

Tricky Details:
Master hands out a short term ( 1 minute) lease for a particular
replica to be the primary one.
This primary replica assigns a serial number to each mutation so that
every replica performs the mutations in the same order.
April 8, 2010
20 / 42
GFS: Master Failure
The Master stores its state via periodic checkpoints and a mutation
log.
April 8, 2010
21 / 42
GFS: Master Failure
log.
Both are replicated.
April 8, 2010
21 / 42
GFS: Master Failure
log.
Master election and notification is implemented using an external lock
server.
April 8, 2010
21 / 42
GFS: Master Failure
log.
Master election and notification is implemented using an external lock
server.
New master restores state from checkpoint and log.
April 8, 2010
21 / 42
MapReduce
MapReduce
April 8, 2010
22 / 42
Do We Need It?
Yes: Otherwise some problems are too big.
April 8, 2010
23 / 42
Do We Need It?

Example: 20+ billion web pages x 20KB = 400+ terabytes
April 8, 2010
23 / 42
Do We Need It?

One computer can read 30-35 MB/sec from disk
April 8, 2010
23 / 42
Do We Need It?

four months to read the web
April 8, 2010
23 / 42
Do We Need It?

four months to read the web
Same problem with 1000 machines, < 3 hours
April 8, 2010
23 / 42
Bad News!
Bad News!!
April 8, 2010
24 / 42
Bad News!
Bad News!!
communication and coordination
April 8, 2010
24 / 42
Bad News!
Bad News!!
recovering from machine failure (all the time!)
April 8, 2010
24 / 42
Bad News!
Bad News!!
debugging
April 8, 2010
24 / 42
Bad News!
Bad News!!
debugging
optimization
April 8, 2010
24 / 42
Bad News!
Bad News!!
debugging
optimization
locality
April 8, 2010
24 / 42
Bad News!
Bad News!!
debugging
optimization
locality
Bad news II: repeat for every problem you want to solve
April 8, 2010
24 / 42
Bad News!
Bad News!!
debugging
optimization
locality
Good News I and II: MapReduce and Hadoop!
April 8, 2010
24 / 42
Bad News!
Bad News!!
debugging
optimization
locality
Good News I and II: MapReduce and Hadoop!
April 8, 2010
24 / 42
MapReduce
A simple programming model that applies to many large-scale

computing problems
April 8, 2010
25 / 42
MapReduce

computing problems
Hide messy details in MapReduce runtime library:
April 8, 2010
25 / 42
MapReduce

computing problems
automatic parallelization
April 8, 2010
25 / 42
MapReduce

computing problems
load balancing
April 8, 2010
25 / 42
MapReduce

computing problems
load balancing
network and disk transfer optimization
April 8, 2010
25 / 42
MapReduce

computing problems
load balancing
handling of machine failures
April 8, 2010
25 / 42
MapReduce

computing problems
load balancing
robustness
April 8, 2010
25 / 42
MapReduce

computing problems
load balancing
robustness
Therfore we can write application level programs and let MapReduce
insulate us from many concerns.
April 8, 2010
25 / 42
MapReduce

computing problems
load balancing
robustness
Therfore we can write application level programs and let MapReduce
insulate us from many concerns.
April 8, 2010
25 / 42
Map Reduce Paradigm
Read a lot of data
April 8, 2010
26 / 42
Map Reduce Paradigm
Read a lot of data

Map: extract something you care about from each record.
April 8, 2010
26 / 42
Map Reduce Paradigm
Read a lot of data

Shuffle and Sort.
April 8, 2010
26 / 42
Map Reduce Paradigm
Read a lot of data

Shuffle and Sort.
Reduce: aggregate, summarize, filter, or transform
April 8, 2010
26 / 42
Map Reduce Paradigm
Read a lot of data

Shuffle and Sort.
Reduce: aggregate, summarize, filter, or transform
Write the results.
April 8, 2010
26 / 42
MapReduce Paradigm
Basic data type: the key-value pair (k,v).
April 8, 2010
27 / 42
MapReduce Paradigm

For example, key = URL, value = HTML of the web page.
April 8, 2010
27 / 42
MapReduce Paradigm

Programmer specifies two primary methods:
April 8, 2010
27 / 42
MapReduce Paradigm

Map: (k, v) > <(k1,v1), (k2,v2), (k3,v3),...,(kn,vn)>
April 8, 2010
27 / 42
MapReduce Paradigm

Reduce: (k, <v1, v2,...,vn>) > <(k, v1), (k, v2),...,(k,
vn)>
April 8, 2010
27 / 42
MapReduce Paradigm

vn)>
All v with same k are reduced together.
April 8, 2010
27 / 42
MapReduce Paradigm

vn)>
(Remember the invisible Shuffle and Sort step.)
April 8, 2010
27 / 42
MapReduce Paradigm

vn)>
(Remember the invisible Shuffle and Sort step.)
April 8, 2010
27 / 42
Working
April 8, 2010
28 / 42
Working
April 8, 2010
29 / 42

One master, many workers
April 8, 2010
30 / 42

Input data split into M map tasks (typically 64 MB in size)
April 8, 2010
30 / 42

of output files)
Reduce phase partitioned into R reduce tasks (#
April 8, 2010
30 / 42

of output files)
Tasks are assigned to workers dynamically
April 8, 2010
30 / 42

of output files)
Master assigns each map task to a free worker
April 8, 2010
30 / 42

of output files)
Considers locality of data to worker when assigning task
April 8, 2010
30 / 42

of output files)
Worker reads task input (often from local disk!)
April 8, 2010
30 / 42

of output files)
Worker produces R local files containing intermediate (k,v) pairs
April 8, 2010
30 / 42

of output files)
Master assigns each reduce task to a free worker
April 8, 2010
30 / 42

of output files)
Worker reads intermediate (k,v) pairs from map workers
April 8, 2010
30 / 42

of output files)
Worker sorts & applies users Reduce op to produce the output
April 8, 2010
30 / 42

of output files)
Worker sorts & applies users Reduce op to produce the output
User may specify Partition: which intermediate keys to which Reducer
April 8, 2010
30 / 42
Robustness
One master, many workers.
April 8, 2010
31 / 42
Robustness

Detect failure via periodic heartbeats.
April 8, 2010
31 / 42
Robustness

Re-execute completed and in-progress map tasks.
April 8, 2010
31 / 42
Robustness

Re-execute in-progress reduce tasks.
April 8, 2010
31 / 42
Robustness

Master assigns each map task to a free worker.
April 8, 2010
31 / 42
Robustness

Master failure:
April 8, 2010
31 / 42
Robustness

Master failure:
State is checkpointed to replicated file system.
April 8, 2010
31 / 42
Robustness

Master failure:
New master recovers & continues.
April 8, 2010
31 / 42
Robustness

Master failure:
New master recovers & continues.
Very Robust: lost 1600 of 1800 machines once, but finished
fine-Google.
April 8, 2010
31 / 42
Hadoop
Hadoop
April 8, 2010
32 / 42
What is hadoop
Apache Hadoop is a Java software framework that supports

data-intensive distributed applications under a free license.
April 8, 2010
33 / 42
What is hadoop

Hadoop was inspired by Googles MapReduce and Google File System
(GFS) papers.
April 8, 2010
33 / 42
What is hadoop

(GFS) papers.
A Map/Reduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner.
April 8, 2010
33 / 42
What is hadoop

(GFS) papers.
manner.
It is then made input to the reduce tasks.
April 8, 2010
33 / 42
What is hadoop

(GFS) papers.
manner.
It is then made input to the reduce tasks.
The framework takes care of scheduling tasks, monitoring them and
re-executes the failed tasks.
April 8, 2010
33 / 42
Who uses Hadoop?

Adobe
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Baidu - the leading Chinese language search engine
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Facebook
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Facebook
Google
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Facebook
Google
IBM
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Facebook
Google
IBM
Twitter
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Facebook
Google
IBM
Twitter
Yahoo!
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Facebook
Google
IBM
Twitter
Yahoo!
The New York Times,Last.fm,Hulu,LinkedIn
April 8, 2010
34 / 42
Who uses Hadoop?

Adobe
AOL
Facebook
Google
IBM
Twitter
Yahoo!
The New York Times,Last.fm,Hulu,LinkedIn
April 8, 2010
34 / 42
Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
April 8, 2010
35 / 42
Mapper
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
April 8, 2010
35 / 42
Mapper
pairs.
Output pairs do not need to be of the same types as input pairs.
April 8, 2010
35 / 42
Mapper
pairs.
Mapper implementations are passed the JobConf for the job.
April 8, 2010
35 / 42
Mapper
pairs.
The framework then calls map method for each key/value pair.
April 8, 2010
35 / 42
Mapper
pairs.
Applications can use the Reporter to report progress.
April 8, 2010
35 / 42
Mapper
pairs.
All intermediate values associated with a given output key are
subsequently grouped by the framework, and passed to the
Reducer(s) to determine the final output.
April 8, 2010
35 / 42
Mapper
pairs.
The intermediate, sorted outputs are always stored in a simple
(key-len, key, value-len, value) format.
April 8, 2010
35 / 42
Mapper
pairs.
The number of maps is usually driven by the total size of the inputs,
that is, the total number of blocks of the input files.
April 8, 2010
35 / 42
Mapper
pairs.
Users can optionally specify a combiner to perform local aggregation
of the intermediate outputs.
April 8, 2010
35 / 42
Mapper
pairs.
Users can optionally specify a combiner to perform local aggregation
of the intermediate outputs.
April 8, 2010
35 / 42
Combiners
When the map operation outputs its pairs they are already available
in memory.
April 8, 2010
36 / 42
Combiners
in memory.
If a combiner is used then the map key-value pairs are not
immediately written to the output.
April 8, 2010
36 / 42
Combiners
in memory.
They are collected in lists, one list per each key value.
April 8, 2010
36 / 42
Combiners
in memory.
They are collected in lists, one list per each key value.
When a certain number of key-value pairs have been written, this
buffer is flushed by passing all the values of each key to the combiners
reduce method and outputting the key-value pairs of the combine
operation as if they were created by the original map operation.
April 8, 2010
36 / 42
Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
April 8, 2010
37 / 42
Reducer
Reducer implementations are passed the JobConf for the job.
April 8, 2010
37 / 42
Reducer
The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each key, (list of values) pair
in the grouped inputs.
April 8, 2010
37 / 42
Reducer
The reducer has 3 primary phases:
April 8, 2010
37 / 42
Reducer
Shuffle:Input to the Reducer is the sorted output of the mappers. In
this phase the framework fetches the relevant partition of the output
of all the mappers, via HTTP.
April 8, 2010
37 / 42
Reducer
Sort:The framework groups Reducer inputs by keys (since different
mappers may have output the same key) in this stage.
April 8, 2010
37 / 42
Reducer
Reduce:In this phase the reduce method is called for each <key, (list
of values)> pair in the grouped inputs.
April 8, 2010
37 / 42
Reducer
Reduce:In this phase the reduce method is called for each <key, (list
of values)> pair in the grouped inputs.
The generated ouput is a new value.
April 8, 2010
37 / 42
Some Terminology
Job A full program - an execution of a Mapper and Reducer

across a data set.
April 8, 2010
38 / 42
Some Terminology

across a data set.
Task An execution of a Mapper or a Reducer on a slice of data
April 8, 2010
38 / 42
Some Terminology

across a data set.
Task An execution of a Mapper or a Reducer on a slice of data
Task Attempt A particular instance of an attempt to execute a task
on a machine.
April 8, 2010
38 / 42
Job Distribution
MapReduce programs are contained in a Java jar file + an XML file

containing serialized program configuration options.
April 8, 2010
39 / 42
Job Distribution

Running a MapReduce job places these files into the HDFS and
notifies TaskTrackers where to retrieve the relevant program code.
April 8, 2010
39 / 42
Job Distribution

Running a MapReduce job places these files into the HDFS and
notifies TaskTrackers where to retrieve the relevant program code.
Data Distribution: Implicit in design of MapReduce!
April 8, 2010
39 / 42
Contact Information
Varun Thacker
varunthacker1989@gmail.com
http:
//varunthacker.wordpress.com

http://lugmanipal.org
http://forums.lugmanipal.org
April 8, 2010
40 / 42
Attribution
Google
Under the Creative Commons Attribution-Share Alike 2.5 Generic.
April 8, 2010
41 / 42
Copying
Creative Commons Attribution-Share Alike 2.5 India License

http://creativecommons.org/licenses/by-sa/2.5/in/
April 8, 2010
42 / 42

Distributed Computing

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Distributed Computing

Загружено:

Авторское право:

Доступные форматы

Distributed Computing

Varun Thacker (LUG Manipal)

Distributed File System

Who are we?

Linux Users Group Manipal

Varun Thacker (LUG Manipal)

Who are we?

Linux Users Group Manipal

Varun Thacker (LUG Manipal)

Who are we?

Linux Users Group Manipal

Varun Thacker (LUG Manipal)

Who are we?

Linux Users Group Manipal

Varun Thacker (LUG Manipal)

Who are we?

Linux Users Group Manipal

Varun Thacker (LUG Manipal)

Who are we?

Linux Users Group Manipal

Varun Thacker (LUG Manipal)

If you have problem(s) dont hesitate to ask

Varun Thacker (LUG Manipal)

If you have problem(s) dont hesitate to ask

Varun Thacker (LUG Manipal)

If you have problem(s) dont hesitate to ask

Varun Thacker (LUG Manipal)

If you have problem(s) dont hesitate to ask

Varun Thacker (LUG Manipal)

If you have problem(s) dont hesitate to ask

Varun Thacker (LUG Manipal)

If you have problem(s) dont hesitate to ask

Varun Thacker (LUG Manipal)

If you have problem(s) dont hesitate to ask

Varun Thacker (LUG Manipal)

Varun Thacker (LUG Manipal)