Вы находитесь на странице: 1из 21

C K Pithawalla College of Engineering & Technology,Surat

Hadoop
Introduction & Setup
Prepared By, Nishant M Gandhi.
Certified network Manager by Nettech. Diploma in Cyber Security(persuing).

What is

Hadoop is a framework for running applications on large clusters built of commodity hardware.
HADOOP WIKI Open source, Java Googles MapReduce inspired Yahoos Hadoop. Now part of Apache group

Hadoop Architecture on DELL C Series Server

Hadoop Software Stack


Hadoop Common: The common utilities Hadoop Distributed File System (HDFS): A distributed file system Hadoop Map Reduce: Distributed processing on compute clusters.

Other Hadoop-related projects: Avro: A data serialization system. Cassandra: A scalable multi-master database Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: A Scalable machine learning and data mining library. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper:co-ordination services

Who Uses Hadoop?

Click to edit Master text styles econd level Third level Fourth level Fifth level

The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query. On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application

Who Uses Hadoop?


The Datawarehouse Hadoop cluster at Facebook

ck to edit Master text styles 21 PB of storage in a single HDFS cluster cond level 2000 machines Third level 12 TB per machine (a few machines have 24 TB each) Fourth level 1200 machines with 8 cores each + 800 machines with 16 cores each Fifth level 32 GB of RAM per machine

15 map-reduce tasks per machine That's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s cluster of 14 PB. Here are the cluster statistics from the HDFS cluster at Facebook:

Who Uses Hadoop?


Creater of MapReduce Runs Hadoop for NSResearch cluster HDFS is inspired by GFS

Who Uses Hadoop?


Other Hadoop Users:
IBM NEW YORK TIMES Twitter Veoh Amazon Apple eBay AOL Hewlett-Packard Joost

Map Reduce
Programming model developed at Google Sort/merge based distributed computing Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. The underlying system takes care of the partitioning of the input data, scheduling the programs execution across several machines, handling machine failures, and managing required intermachine communication. (This is the key for Hadoops success)

Map Reduce

Hadoop Distributed File System (HDFS)


At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).

Goals of HDFS
Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth User Space, runs on heterogeneous OS

DFShell
The HDFS shell is invoked by: bin/hadoop dfs <args>

cat chgrp chmod chown copyFromLocal copyToLocal cp du dus

expunge get getmerge ls lsr mkdir movefromLocal mv touchz

put rm rmr setrep stat tail test text

Hadoop Single Node Setup


Step 1:
Download hadoop from http://hadoop.apache.org/mapreduce/releases.html Step 2: Untar the hadoop file: tar xvfz hadoop-0.20.2.tar.gz

Hadoop Single Node Setup


Step 3:
Set the path to java compiler by editing JAVA_HOME Parameter in hadoop/conf/hadoop--env.sh

Hadoop Single Node Setup


Step 4:
Create an RSA key to be used by hadoop when sshing to localhost: ssh-keygen -t rsa -P cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Hadoop Single Node Setup


Step 5:
Do the following changes to the configuration files under hadoop/conf

core--site.xml:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>TEMPORARY-DIR-FOR-HADOOPDATASTORE</ value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> </configuration>

Hadoop Single Node Setup


mapred--site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> </property> </configuration>

hdfs--site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

Hadoop Single Node Setup


Step 6:
Format the hadoop file system. From hadoop directory run the following:
bin/hadoop namenode -format

Using Hadoop
1)How to start Hadoop? cd hadoop/bin ./start-all.sh 2)How to stop Hadoop? cd hadoop/bin ./stop-all.sh 3)How to copy file from local to HDFS? cd hadoop bin/hadoop dfs put local_machine_path hdfs_path 4)How to list files in HDFS? cd hadoop bin/hadoop dfs -ls

Thank You..

Вам также может понравиться