Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse

Exploring BigData with
Hadoop
Dr.A.Bazila Banu
ASSOCIATE PROFESSOR
DEPARTMENT OF CSE
Introduction
1. Introduction: Hadoop’s history and

advantages
2. Architecture in detail
3. Hadoop Installation and Configuration

What is Hadoop?
• Apache top level project, open-source
implementation of frameworks for reliable, scalable,
distributed computing and data storage.
• It is a flexible and highly-available architecture for

large scale computation and data processing on a
network of commodity hardware.
What is Big Data
Big data is a term that describes the large
volume of data – both structured and
unstructured – that inundates a business on a
day-to-day basis
Brief History of Hadoop
Designed to answer the question: “How to process big data

with reasonable cost and time?”
Hadoop’s Developers
2005: Doug Cutting and Michael J.

Cafarella developed Hadoop to support
distribution for the Nutch search engine
project.
The project was funded by Yahoo.

Doug Cutting
2006: Yahoo gave the project to Apache
Software Foundation.
Hadoop in the Wild
• Hadoop is in use at most organizations that handle big

data:
o Yahoo!
o Facebook
o Amazon
o Netflix
o Etc…
• Some examples of scale:

o Yahoo!’s Search Webmap runs on 10,000 core Linux
cluster and powers Yahoo! Web search
o FB’s Hadoop cluster hosts 100+ PB of data (July, 2012)

& growing at ½ PB/day (Nov, 2012)
Goals / Requirements:
• Abstract and facilitate the storage and processing of

large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
Hadoop Framework
• Data is processed in
parallel and accomplish the
entire statistical analysis on
large amount of data.
• It is a framework which is
based on java programming.
• It is intended to work upon
from a single server to
thousands of machines each
offering local computation
and storage
Hadoop’s Architecture-
Name Node,Data Node
Hadoop’s Architecture:
MapReduce Engine
Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the

computational power and storage of the system lies
• Main nodes run TaskTracker to accept and reply to

MapReduce tasks, and also DataNode to store
needed blocks closely as possible
• Central control node runs NameNode to keep track

of HDFS directories & files, and JobTracker to
dispatch compute tasks to TaskTracker
• Written in Java, also supports Python and Ruby

NameNode:
• Stores metadata for the files, like the directory structure of
a typical FS.
• The server holding the NameNode instance is quite

crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use

transactions for whole blocks or file-streams, only
metadata.
• Handles creation of more replica blocks when necessary

after a DataNode failure
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere

Map Reduce Engine:
• MapReduce is a method for distributing a task across
multiple nodes. Each node processes data stored on that
node to the extent possible.
• Map Reduce job consists of various phases such as

Map -> Sort -> Shuffle -> Reduce
Every MapReduce job consists of three portions
I. The driver code

-Code that runs on the client to configure and submit the job
I. The Mapper
II. The Reducer
MapReduce: The Mapper
• Hadoop attempts to ensure that Mappers run on nodes, which

hold their portion of the data locally, to minimize network traffic.
• Multiple Mappers run in parallel, each processing a portion of the

input data.
• The Mapper reads data in the form of key/value pairs. It outputs

zero or more key/value pairs
map(in_key, in_value) -> (inter_key, inter_value) list
The key is the byte offset into the file at which the line starts.
The value is the contents of the line itself.
MapReduce: The Reducer
• After the Map phase is over, all the intermediate values for a
given intermediate key are combined together into a list. This
list is given to a Reducer
• The intermediate keys and their value lists are passed to the
Reducer in sorted key order. This step is known as the shuffle
and sort.
• The Reducer outputs zero or more final key/value pairs. These

are written to HDFS
Demo
• Word Count
– hadoop jar hadoop-0.20.2-examples.jar
wordcount <input dir> <output dir>
Hadoop Framework Tools
Hadoop Projects
• The project includes these modules:
• Hadoop Common: The common utilities that support the
other Hadoop modules.
• Hadoop Distributed File System (HDFS): A distributed file
system that provides high-throughput access to application
data.
• Hadoop YARN: A framework for job scheduling and cluster
resource management.
• Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.
Hadoop Projects
• Ambari: A web-based tool for provisioning, managing, and monitoring Apache
Hadoop clusters which includes support for Hadoop HDFS
• Avro: A data serialization system.
• Cassandra: A scalable multi-master database with no single points of failure.
• Chukwa: A data collection system for managing large distributed systems.
• HBase: A scalable, distributed database that supports structured data storage
for large tables.
• Hive: A data warehouse infrastructure that provides data summarization and ad
hoc querying.
• Mahout: A Scalable machine learning and data mining library.
• Pig: A high-level data-flow language and execution framework for parallel
computation.
• Spark: A fast and general compute engine for Hadoop data.
• Tez: A generalized data-flow programming framework, built on Hadoop YARN,.
• ZooKeeper: A high-performance coordination service for distributed
applications.
References
• http://www.edupristine.com/blog/10-best-
ebooks-hadoop

Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse

Загружено:

Авторское право:

Доступные форматы

Exploring BigData with

1. Introduction: Hadoop’s history and

3. Hadoop Installation and Configuration

• It is a flexible and highly-available architecture for

Designed to answer the question: “How to process big data

2005: Doug Cutting and Michael J.

The project was funded by Yahoo.

• Hadoop is in use at most organizations that handle big

• Some examples of scale:

o FB’s Hadoop cluster hosts 100+ PB of data (July, 2012)

• Abstract and facilitate the storage and processing of

• High scalability and availability

• Use commodity (cheap!) hardware with little redundancy

• Main nodes of cluster are where most of the

• Main nodes run TaskTracker to accept and reply to

• Central control node runs NameNode to keep track

• Written in Java, also supports Python and Ruby

• The server holding the NameNode instance is quite

• Transaction log for file deletes/adds, etc. Does not use

• Handles creation of more replica blocks when necessary

• Stores the actual data in HDFS

• Can run on any underlying filesystem (ext3/4, NTFS, etc)

• Notifies NameNode of what blocks it has

• NameNode replicates blocks 2x in local rack, 1x elsewhere

• Map Reduce job consists of various phases such as

Every MapReduce job consists of three portions

I. The driver code

• Hadoop attempts to ensure that Mappers run on nodes, which

• Multiple Mappers run in parallel, each processing a portion of the

• The Mapper reads data in the form of key/value pairs. It outputs

• The Reducer outputs zero or more final key/value pairs. These

Вам также может понравиться