Вы находитесь на странице: 1из 23

Exploring BigData with

Hadoop

Dr.A.Bazila Banu
ASSOCIATE PROFESSOR
DEPARTMENT OF CSE
Introduction

1. Introduction: Hadoop’s history and


advantages

2. Architecture in detail

3. Hadoop Installation and Configuration


What is Hadoop?
• Apache top level project, open-source
implementation of frameworks for reliable, scalable,
distributed computing and data storage.

• It is a flexible and highly-available architecture for


large scale computation and data processing on a
network of commodity hardware.
What is Big Data
Big data is a term that describes the large
volume of data – both structured and
unstructured – that inundates a business on a
day-to-day basis
Brief History of Hadoop

Designed to answer the question: “How to process big data


with reasonable cost and time?”
Hadoop’s Developers

2005: Doug Cutting and Michael J.


Cafarella developed Hadoop to support
distribution for the Nutch search engine
project.

The project was funded by Yahoo.


Doug Cutting
2006: Yahoo gave the project to Apache
Software Foundation.
Hadoop in the Wild

• Hadoop is in use at most organizations that handle big


data:
o Yahoo!
o Facebook
o Amazon
o Netflix
o Etc…

• Some examples of scale:


o Yahoo!’s Search Webmap runs on 10,000 core Linux
cluster and powers Yahoo! Web search

o FB’s Hadoop cluster hosts 100+ PB of data (July, 2012)


& growing at ½ PB/day (Nov, 2012)
Goals / Requirements:

• Abstract and facilitate the storage and processing of


large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models

• High scalability and availability

• Use commodity (cheap!) hardware with little redundancy

• Fault-tolerance
Hadoop Framework
• Data is processed in
parallel and accomplish the
entire statistical analysis on
large amount of data.
• It is a framework which is
based on java programming.
• It is intended to work upon
from a single server to
thousands of machines each
offering local computation
and storage
Hadoop’s Architecture-
Name Node,Data Node
Hadoop’s Architecture:
MapReduce Engine
Hadoop’s Architecture
• Distributed, with some centralization

• Main nodes of cluster are where most of the


computational power and storage of the system lies

• Main nodes run TaskTracker to accept and reply to


MapReduce tasks, and also DataNode to store
needed blocks closely as possible

• Central control node runs NameNode to keep track


of HDFS directories & files, and JobTracker to
dispatch compute tasks to TaskTracker

• Written in Java, also supports Python and Ruby


Hadoop’s Architecture
NameNode:
• Stores metadata for the files, like the directory structure of
a typical FS.

• The server holding the NameNode instance is quite


crucial, as there is only one.

• Transaction log for file deletes/adds, etc. Does not use


transactions for whole blocks or file-streams, only
metadata.

• Handles creation of more replica blocks when necessary


after a DataNode failure
Hadoop’s Architecture
DataNode:

• Stores the actual data in HDFS

• Can run on any underlying filesystem (ext3/4, NTFS, etc)

• Notifies NameNode of what blocks it has

• NameNode replicates blocks 2x in local rack, 1x elsewhere


Hadoop’s Architecture
Map Reduce Engine:
• MapReduce is a method for distributing a task across
multiple nodes. Each node processes data stored on that
node to the extent possible.

• Map Reduce job consists of various phases such as


Map -> Sort -> Shuffle -> Reduce

Every MapReduce job consists of three portions

I. The driver code


-Code that runs on the client to configure and submit the job
I. The Mapper
II. The Reducer
MapReduce: The Mapper

• Hadoop attempts to ensure that Mappers run on nodes, which


hold their portion of the data locally, to minimize network traffic.

• Multiple Mappers run in parallel, each processing a portion of the


input data.

• The Mapper reads data in the form of key/value pairs. It outputs


zero or more key/value pairs
map(in_key, in_value) -> (inter_key, inter_value) list

The key is the byte offset into the file at which the line starts.
The value is the contents of the line itself.
MapReduce: The Reducer

• After the Map phase is over, all the intermediate values for a
given intermediate key are combined together into a list. This
list is given to a Reducer

• The intermediate keys and their value lists are passed to the
Reducer in sorted key order. This step is known as the shuffle
and sort.

• The Reducer outputs zero or more final key/value pairs. These


are written to HDFS
Demo
• Word Count
– hadoop jar hadoop-0.20.2-examples.jar
wordcount <input dir> <output dir>
Hadoop Framework Tools
Hadoop Projects
• The project includes these modules:
• Hadoop Common: The common utilities that support the
other Hadoop modules.
• Hadoop Distributed File System (HDFS): A distributed file
system that provides high-throughput access to application
data.
• Hadoop YARN: A framework for job scheduling and cluster
resource management.
• Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.
Hadoop Projects
• Ambari: A web-based tool for provisioning, managing, and monitoring Apache
Hadoop clusters which includes support for Hadoop HDFS
• Avro: A data serialization system.
• Cassandra: A scalable multi-master database with no single points of failure.
• Chukwa: A data collection system for managing large distributed systems.
• HBase: A scalable, distributed database that supports structured data storage
for large tables.
• Hive: A data warehouse infrastructure that provides data summarization and ad
hoc querying.
• Mahout: A Scalable machine learning and data mining library.
• Pig: A high-level data-flow language and execution framework for parallel
computation.
• Spark: A fast and general compute engine for Hadoop data.
• Tez: A generalized data-flow programming framework, built on Hadoop YARN,.
• ZooKeeper: A high-performance coordination service for distributed
applications.
References
• http://www.edupristine.com/blog/10-best-
ebooks-hadoop

Вам также может понравиться