Академический Документы
Профессиональный Документы
Культура Документы
Naveen Subramani
Hadoop and the Hadoop Logo are registered trademarks of Hadoop. Read Hadoops
trademark policy here.
Ubuntu, the Ubuntu Logo and Canonical are registered trademarks of Canonical. Read
Canonicals trademark policy here.
All other trademarks mentioned in the book belong to their respective owners.
This book is aimed at making it easy/simple for a beginner to build a Hadoop Cluster. This
book will be updated periodically based on the suggestions, ideas, corrections, etc., from
readers.
Mail Feedback to: books@pinlabs.in
Released under Creative Commons - Attribution-ShareAlike 4.0 International license.
A brief description of the license
A more detailed license text
Preface
About this guide
We have been working on Hadoop for quite sometime. To share our knowledge of hadoop I write
this guide to help people install Hadoop easily. This guide is based on Hadoop installation on
Ubuntu 14.04 LTS.
Target Audience
Our aim has been to provide a guide for a beginners who are new to Hadoop Implementation.
Some familiarity with Big data is assumed for the users of this book.
Acknowledgement
Some of the content and definitions have been borrowed from web resources like manuals and
documentation, white papers etc. from hadoop.apache.org. We would like to thank all the
authors of these resources.
License
Attribution-ShareAlike 4.0 International.
For the full version of the license text,
please refer to http://creativecommons.org/licenses/by-sa/4.0/legalcode and http://
creativecommons.org/licenses/by-sa/4.0/ for a shorter description.
Feedback
We would really appreciate your feedback. We will enhance the book on an ongoing basis based
on your feedback. Please mail your feedback to books@pinlabs.in.
Contents
Hadoop Ecosystem
1.1
1.2
1.2.1
What Is HDFS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2
1.3
2
11
2.1
Supported Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2
Pseudo-Distributed Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1
2.2.2
Installation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3
2.2.2.1
2.2.2.2
2.2.2.2.2
2.2.2.3
2.2.2.4
2.2.2.5
Configuring core-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2.6
Configuring hdfs-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2.7
Configuring mapred-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2.8
Configuring yarn-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3.1
2.2.3.2
2.2.4
2.2.5
Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.6
Web UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1
3.1.2
Installation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2.1
3.1.2.2
Configuring core-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2.3
Configuring hdfs-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2.4
Configuring mapred-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2.5
Configuring yarn-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3
3.1.4
3.1.5
3.1.6
3.1.6.2
3.1.7
3.1.8
Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.9
Web UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
23
Supported Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2
Standalone Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
27
5.1
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2
Pseudo-Distributed Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
31
6.1
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2
Fully-Distributed Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Fully-Distributed Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
17
Hive Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.1.1
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.1.2
Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
37
39
Pig Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.1.1
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.1.2
Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
43
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.1.2
Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
List of Tables
1.1
Hadoop Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
6.1
8.1
Pig Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 1
Hadoop Ecosystem
1.1
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters
of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each
of which may be prone to failures.
1.2
The Apache Hadoop consists of two key components: reliable, distributed file system called Hadoop Distributed File System
(HDFS) and the high-performance parallel data processing engine called Apache YARN. The most important aspect of Hadoop
is that ability to move computation to the data rather than moving data to computation. Thus HDFS and MapReduce are tightly
integrated.
1.2.1
What Is HDFS?
HDFS is distributed file system that provides high-throughput access to data.HDFS creates multiple replicas(default : 3) of each
data block across the hadoop cluster to enable reliable and rapid access to the data.
The Main Daemons of HDFS are listed Below.
NameNode is the master of the system. It oversees and coordinates data storage (directories and files).
DataNodes are the actual slaves which are deployed in all slave machines provides actual storage to HDFS. It oversees and
coordinates reads and writes requests from client.
Secondary NameNode is responsible for performing periodic checkpoints. In the event of NameNode failure, you can restart
the NameNode using the checkpoint. But its not a Backup Node for Name node.
1.2.2
Yet Another Resource Negotiator (YARN) is 2nd generation MapReduce (MR) which splits up the two major responsibilities
of the MapReduce - JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global
ResourceManager and per-application ApplicationMaster (AM).
The Main Daemons of YARN are listed Below.
The ResourceManager has two main components: Scheduler and ApplicationsManager.The Scheduler is responsible for
allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific
ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage
(cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
JobHistoryServer is a daemon that serves historical information about completed applications. Typically, JobHistory server
can be co-deployed with JobTracker, but we recommend running JobHistory server as a separate daemon.
Module
Apache HDFS
Apache YARN
Hadoop MapReduce
Apache HBase
Apache Hive
Apache Pig
Apache Spark
Description
A distributed file system
that provides
high-throughput access to
application data.
A framework for job
scheduling and cluster
resource management.
A YARN-based system for
parallel processing of large
data sets.
A scalable, distributed
database that supports
structured data storage for
large tables.
A data warehouse
infrastructure that provides
data summarization and ad
hoc querying.
A high-level data-flow
language and execution
framework for parallel
computation.
A fast and general compute
engine for Hadoop data.
Spark provides a simple and
expressive programming
model that supports a wide
range of applications,
including ETL, machine
learning, stream processing,
and graph computation.
Version
Installation Guide
2.5.1
Pseudo-Distributed
Fully-Distributed
2.5.1
Pseudo-Distributed
Fully-Distributed
2.5.1
Pseudo-Distributed
Fully-Distributed
0.98.8-hadoop2
Standalone
Pseudo-Distributed
Fully-Distributed
0.14.0
Standalone
0.14.0
Standalone
1.1.0
Standalone
1.3
export
export
export
export
export
export
export
JAVA_HOME=/opt/jdk1.7.0_51
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
HADOOP_HOME="$HOME/hadoop-2.5.1"
HIVE_HOME="$HOME/hive-0.14.0"
HBASE_HOME="$HOME/hbase-0.98.8-hadoop2"
PIG_HOME="$HOME/pig-0.14.0"
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin
Esc -> :wq for save and quit from vim editor
$ source .bashrc
11
Chapter 2
Supported Modes
Apache Hadoop cluster can be installed in one of the three supported modes
Local (Standalone) Mode - Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for
debugging.
Pseudo-Distributed Mode - pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Fully-Distributed Mode - Master, Slave cluster setup where daemons runs on seperate machines.
2.2
Pseudo-Distributed Mode
2.2.1
2.2.2
Installation Notes
2.2.2.1
Before begin with installation let us update the Ubuntu packages with latest contents and get some tools for edition
$ sudo apt-get update
$ sudo apt-get install vim
$ cd
2.2.2.2
For Running Apache hadoop java jdk1.7 is required. Install open-jdk1.7 or oracle jdk1.7 on your ubuntu machine, for installing
open-jdk-1.7 read this and for oracle jdk1.7 read this.
2.2.2.2.1
Esc -> :wq for save and quit from vim editor
$ source .bashrc
$ java -version
2.2.2.2.2
wget https://dl.dropboxusercontent.com/u/24798834/Hadoop/jdk-7u51-linux-x64.tar.gz
tar xzf jdk-7u51-linux-x64.tar.gz
sudo mv jdk1.7.0_51 /opt/
vim .bashrc
Esc -> :wq for save and quit from vim editor
$ source .bashrc
$ java -version
Console output :
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
2.2.2.3
2.2.2.4
13
Ensure that JAVA_HOME is set in hadoop-env.sh and points to the Java installation you intend to use. You can set other environment variables in hadoop-env.sh to suit your requirments. Some of the default settings refer to the variable HADOOP_HOME.
The value of HADOOP_HOME is automatically inferred from the location of the startup scripts. HADOOP_HOME is the parent
directory of the bin directory that holds the Hadoop scripts. In this instance it is $HADOOP_INSTALL/hadoop
Configure JAVA_HOME in hadoop-env.sh file by uncommenting the line export JAVA_HOME= and replace with below contents
for open-jdk
export JAVA_HOME=/usr
Configure JAVA_HOME in hadoop-env.sh file by uncommenting the line export JAVA_HOME= and replace with below contents
for oracle-jdk
export JAVA_HOME=/opt/jdk1.7.0_51
2.2.2.5
Configuring core-site.xml
etc/hadoop/core-site.xml:
Edit core-site.xml available in ($HADOOP_HOME/etc/hadoop/core-site.xml) with the contents below.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
2.2.2.6
Configuring hdfs-site.xml
etc/hadoop/hdfs-site.xml:
Edit hdfs-site.xml available in ($HADOOP_HOME/etc/hadoop/hdfs-site.xml) with the contents below.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/ubuntu/yarn/yarn_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/ubuntu/yarn/yarn_data/hdfs/datanode</value>
</property>
</configuration>
Where ubuntu is current user. dfs.replication represents the data replication factor (by default 3) since its single node it has
be set as 1. dfs.namenode.name.dir attribute defines the local directory for storing the name node data. dfs.datanode.data.dir
attribute defines the local directory for storing the data node data (ie. actual user data).
2.2.2.7
Configuring mapred-site.xml
etc/hadoop/mapred-site.xml:
Edit mapred-site.xml available in ($HADOOP_HOME/etc/hadoop/mapred-site.xml) with the contents below.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
2.2.2.8
Configuring yarn-site.xml
etc/hadoop/yarn-site.xml:
Edit yarn-site.xml available in ($HADOOP_HOME/etc/hadoop/yarn-site.xml) with the contents below.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
2.2.3
Execution
Now all the configuration has be done next step is to formate the name node and to start the hadoop cluster. Frmate the name
node using the below command available at $HADOOP_HOME directory.
$ bin/hadoop namenode -format
2.2.3.1
15
Start the hadoop cluster with the below command available ad $HADOOP_HOME directory.
$ sbin/start-all.sh
2.2.3.2
After starting the hadoop cluster you can verify for the 5 hadoop daemons using the java profiling tool (jps) jps command which
will display the daemons with its pid.It should list all 5 daemons 1.Namnode, 2.DataNode, 3.SecondaryNameNode 4.NodeManager 5. ResourceManger
$ jps
Console output:
11495
11653
11260
25361
25217
26101
2.2.4
SecondaryNameNode
ResourceManager
NameNode
NodeManager
DataNode
Jps
Now hadoop cluster is up and running . Lets run a famous wordcount program on hadoop cluster. We have to create and upload
some test input file for word count program in hdfs (/input) in this path.
Create a input folder and test file using the below commands.
Note: Execute these commands inside $HADOOP_HOME folder.
$ mkdir input && echo "This is word count example using hadoop 2.2.0" >> input/file
Upload the created folder to HDFS. On successfull execution you can able to see a folder contents on HDFS web ui (http://localhost:50070
under path /input/file .
$ bin/hadoop dfs -copyFromLocal input /input
On successfull run the output of the wordcount result will be stored under the directory /output in HDFS and the result will be
avilable in part-r-00000 file. _SUCCESS file indicates that successfull run of the job
2.2.5
Debugging
If your hadoop cluster fails to list all the daemons you can monitor the log files avialble at $HADOOP_HOME/logs directory.
$ ls -al logs
2.2.6
Web UI
UI
UI
UI
UI
for
for
for
for
Hadoop
Hadoop
Hadoop
Hadoop
NameNode: http://localhost:50070/
HDFS: http://localhost:50070/explorer.html
JobTracker: http://localhost:50030/
TaskTracker: http://localhost:50060/
17
Chapter 3
Fully-Distributed Mode
Yarn fully-distributed mode is Master, Slave cluster setup where daemons runs on seperate machines. In this book we take two
node cluster setup and going to implement hadoop cluster on 3 machines where 1 machine will act as Master Node and another
two machines will act as slave nodes. Below table view describes about the machine configurations.
Name
Machine
MasterNode
SlaveNode1
SlaveNode2
Roles
Name Node,Secondary
NameNode,ResourceManager
DataNode,NodeManager
DataNode,NodeManager
3.1.1
3.1.2
Installation Notes
1. Setup some tools on all 3 machines - for setting up required tools refer this Document. 2. Next install jdk on all 3 machines for setting up JDK refer this Document.
3.1.2.1
Download and install hadoop 2.5.1 package on home dir of ubuntu on all machines (ie. Master node , Slave Node1,Slave Node2).
$ wget http://apache.cs.utah.edu/hadoop/common/stable/hadoop-2.5.1.tar.gz
$ tar xzf hadoop-2.5.1.tar.gz
$ cd hadoop-2.5.1/etc/hadoop
Ensure that JAVA_HOME is set in hadoop-env.sh and points to the Java installation you intend to use. You can set other environment variables in hadoop-env.sh to suit your requirments. Some of the default settings refer to the variable HADOOP_HOME.
The value of HADOOP_HOME is automatically inferred from the location of the startup scripts. HADOOP_HOME is the parent
directory of the bin directory that holds the Hadoop scripts. In this instance it is $HADOOP_INSTALL/hadoop
Configure JAVA_HOME in hadoop-env.sh file by uncommenting the line export JAVA_HOME= and replace with below contents
for open-jdk
export JAVA_HOME=/usr
Configure JAVA_HOME in hadoop-env.sh file by uncommenting the line export JAVA_HOME= and replace with below contents
for oracle-jdk
export JAVA_HOME=/opt/jdk1.7.0_51
3.1.2.2
Configuring core-site.xml
etc/hadoop/core-site.xml:
Edit core-site.xml available in ($HADOOP_HOME/etc/hadoop/core-site.xml) with the contents below.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://<Master host Name>:9000</value>
</property>
</configuration>
3.1.2.3
Configuring hdfs-site.xml
etc/hadoop/hdfs-site.xml:
Edit hdfs-site.xml available in ($HADOOP_HOME/etc/hadoop/hdfs-site.xml) with the contents below.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/ubuntu/yarn/yarn_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/ubuntu/yarn/yarn_data/hdfs/datanode</value>
</property>
</configuration>
Where ubuntu is current user. dfs.replication represents the data replication factor (by default 3) since its single node it has
be set as 1. dfs.namenode.name.dir attribute defines the local directory for storing the name node data. dfs.datanode.data.dir
attribute defines the local directory for storing the data node data (ie. actual user data).
3.1.2.4
19
Configuring mapred-site.xml
etc/hadoop/mapred-site.xml:
Edit mapred-site.xml available in ($HADOOP_HOME/etc/hadoop/mapred-site.xml) with the contents below.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
3.1.2.5
Configuring yarn-site.xml
etc/hadoop/yarn-site.xml:
Edit yarn-site.xml available in ($HADOOP_HOME/etc/hadoop/yarn-site.xml) with the contents below.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value><master hostname>:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value><master hostname>:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value><master hostname>:8040</value>
</property>
</configuration>
3.1.3
After configuring the hadoop config files . Now we have to add list of slave machines on slaves file located in $HADOOP_HOME/etc/hado
directory. Edit the file and remove localhost entry and append the lines with the fillowing contents. Kindly replace appropiate ip
address with your data center address.
3.1.4
Now we have to add routing of all machine details on /etc/hosts entry with the following entry. Kindly replace appropiate ip
address with your data center address. Change this entry on all machines
<masternode private ip>
<slavenode1 private ip>
<slavenode1 private ip>
<master hostname>
<slave1 hostname>
<slave2 hostname>
3.1.5
Make sure that you are able to login to all the slaves without password.
3.1.6
Now all the configuration has be done next step is to format the name node and to start the hadoop cluster. format the name node
using the below command available at $HADOOP_HOME directory. Execution is done on Master Node
$ bin/hadoop namenode -format
3.1.6.1
Start the hadoop cluster with the below command available ad $HADOOP_HOME directory.
$ sbin/start-all.sh
3.1.6.2
21
After starting the hadoop cluster you can verify for the 5 hadoop daemons using the java profiling tool (jps) jps command which
will display the daemons with its pid.It should list appropiate daemons on various machine . check below table for verifying
which daemons runs on which machine
$ jps
Console output:
Name
Machine
MasterNode
SlaveNode1
SlaveNode2
Roles
Name Node,Secondary
NameNode,ResourceManager
DataNode,NodeManager
DataNode,NodeManager
3.1.7
Now hadoop cluster is up and running . Lets run a famous wordcount program on hadoop cluster. We have to create and upload
some test input file for word count program in hdfs (/input) in this path. Execute all these commands on master machine.
Create a input folder and test file using the below commands.
Note: Execute these commands inside $HADOOP_HOME folder.
$ mkdir input && echo "This is word count example using hadoop 2.2.0" >> input/file
Upload the created folder to HDFS. On successfull execution you can able to see a folder contents on HDFS web ui (http://localhost:50070
under path /input/file .
$ bin/hadoop dfs -copyFromLocal input /input
On successfull run the output of the wordcount result will be stored under the directory /output in HDFS and the result will be
avilable in part-r-00000 file. _SUCCESS file indicates that successfull run of the job
3.1.8
Debugging
If your hadoop cluster fails to list all the daemons you can monitor the log files avialble at $HADOOP_HOME/logs directory.
$ ls -al logs
3.1.9
Web UI
Web
Web
Web
Web
UI
UI
UI
UI
for
for
for
for
Hadoop
Hadoop
Hadoop
Hadoop
23
Chapter 4
Supported Modes
Apache HBase cluster can be installed in one of the three supported modes
Local (Standalone) Mode - Hadoop is configured to run against the local filesystem. This is not an appropriate configuration
for a production instance of HBase. This is useful for experimenting with HBase. you can insert rows into the table, perform
put and scan operations against the table, enable or disable the table, and start and stop HBase using hbase shell CLI.
Pseudo-Distributed Mode - Pseudo-Distributed mode where each Hbase daemon (HMaster, HRegionServer, and Zookeeper)
runs in a separate Java process,but on single host.
Fully-Distributed Mode - Master, Slave cluster setup where daemons runs on seperate machines.In a distributed configuration,
the cluster contains multiple nodes, each of which runs one or more HBase daemon. These include primary and backup Master
instances, multiple Zookeeper nodes, and multiple RegionServer nodes.fully-distributed uses real-world scenarios.
4.1.1
Requirements
HBase requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation. See
Hadoop installation section for Hadoop Installation. Ensure to set HADOOP_HOME entry in .bashrc.
4.1.2
Standalone Mode
Stand alone mode of installation uses local file system for storing hbase data.stand alone mode is not suitable for production. It
is met for development and testing purpose.
Loopback IP - HBase 0.94.x and earlier
Prior to HBase 0.94.x, HBase expected the loopback IP address to be 127.0.0.1. Ubuntu and some other distributions default to
127.0.1.1 and this will cause problems for you.
Example /etc/hosts file looks like this
127.0.0.1 localhost
127.0.0.1 mydell
Choose a download site from this list of Apache Download Mirrors. Click on the suggested top link. This will take you to a
mirror of HBase Releases. Click on the folder named stable and then download the binary file that ends in .tar.gz to your local
filesystem. Be sure to choose the version that corresponds with the version of Hadoop you are likely to use later. In most cases,
you should choose the file for Hadoop 2, which will be called something like hbase-0.98.3-hadoop2-bin.tar.gz. Do not download
the file ending in src.tar.gz for now.
Extract HBase Package
$ vim conf/hbase-env.sh
uncomment JAVA_HOME entry in conf/hbase-env.sh and point you jdk location as /opt/jdk1.7.0_51 Example:
export JAVA_HOME=/opt/jdk1.7.0_51
Edit conf/hbase-site.xml
Edit conf/hbase-site.xml and add entry for zookeeper data directory and data directory for hbase. replace the contents with the
below contents.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/ubuntu/yarn/hbase_data/data</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/ubuntu/yarn/hbase_data/zookeeper</value>
</property>
</configuration>
Start HBase
Start Hbase by runninng shell script available in bin/start-hbase.sh.after you started jps command should give the list of java
process HMaster daemon responsible for hbase.
$ bin/start-hbase.sh
After installing HBase now its time to get started with HBase shell.Lets fire up Hbase shell using command bin/hbase shell
command.
$ bin/hbase shell
25
Create a table
Use create command to create table.you must specify table name and column family as argument for Create command.In this
below command we are creating a table called employeedb with column family finance.
hbase> create 'employeedb', 'finance'
0 row(s) in 1.2200 seconds
List tables
Delete a table
To delete a table in hbase you have to first disable the table then only you can able to delete the table. use disable command to
disable the table and enable command to enable the table.drop command to drop the table
hbase> disable 'employeedb'
0 row(s) in 1.6270 seconds
hbase> drop 'employeedb'
0 row(s) in 0.2900 seconds
Stopping HBase
27
Chapter 5
Requirements
HBase requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation. See
Hadoop installation section for Hadoop Installation. Ensure to set HADOOP_HOME entry in .bashrc.
5.2
Pseudo-Distributed Mode
Pseudo-Distributed mode of installation uses HDFS file system for storing hbase data.Pseudo-Distributed mode is not suitable
for production. It is met for development and testing purpose in single machine.
Loopback IP - HBase 0.94.x and earlier
Prior to HBase 0.94.x, HBase expected the loopback IP address to be 127.0.0.1. Ubuntu and some other distributions default to
127.0.1.1 and this will cause problems for you.
Example /etc/hosts file looks like this
127.0.0.1 localhost
127.0.0.1 mydell
Choose a download site from this list of Apache Download Mirrors. Click on the suggested top link. This will take you to a
mirror of HBase Releases. Click on the folder named stable and then download the binary file that ends in .tar.gz to your local
filesystem. Be sure to choose the version that corresponds with the version of Hadoop you are likely to use later. In most cases,
you should choose the file for Hadoop 2, which will be called something like hbase-0.98.3-hadoop2-bin.tar.gz. Do not download
the file ending in src.tar.gz for now.
Extract HBase Package
$ vim conf/hbase-env.sh
uncomment JAVA_HOME entry in conf/hbase-env.sh and point you jdk location as /opt/jdk1.7.0_51 Example:
export JAVA_HOME=/opt/jdk1.7.0_51
Edit conf/hbase-site.xml
Edit conf/hbase-site.xml and add entry for zookeeper data directory and data directory for hbase. replace the contents with
the below contents.In this case we use HDFS for HBase storage and we going to run region server and zookeeper as seperate
daemons.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/ubuntu/yarn/hbase_data/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
Start HBase
Start Hbase by runninng shell script available in bin/start-hbase.sh.after you started jps command should give the list of java
process responsible for hbase (HMaster,HRegionServer,HQuorumPeer).
$ bin/start-hbase.sh
5605
5826
5003
5545
4756
5728
4546
5157
4907
HMaster
Jps
JobTracker
HQuorumPeer
DataNode
HRegionServer
NameNode
TaskTracker
SecondaryNameNode
29
After installing HBase now its time to get started with HBase shell.Lets fire up Hbase shell using command bin/hbase shell
command.
$ bin/hbase shell
Create a table
Use create command to create table.you must specify table name and column family as argument for Create command.In this
below command we are creating a table called employeedb with column family finance.
hbase> create 'employeedb', 'finance'
0 row(s) in 1.2200 seconds
List tables
Delete a table
To delete a table in hbase you have to first disable the table then only you can able to delete the table. use disable command to
disable the table and enable command to enable the table.drop command to drop the table
hbase> disable 'employeedb'
0 row(s) in 1.6270 seconds
hbase> drop 'employeedb'
0 row(s) in 0.2900 seconds
Stopping HBase
31
Chapter 6
Requirements
HBase requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation. See
Hadoop installation section for Hadoop Installation. Ensure to set HADOOP_HOME entry in .bashrc.
6.2
Fully-Distributed Mode
In Fully-Distributed mode, the cluster contains multiple nodes, each of which runs one or more HBase daemon. These include
primary and backup Master instances, multiple Zookeeper nodes, and multiple RegionServer nodes. It is well suitable for realworld scenarios.
Distributed Mode Sample Architecture
Node Name
node1.sample.com
node2.sample.com
node3.sample.com
Roles
Master,ZooKeeper
Backupnode,ZooKeeper,RegionServer
ZooKeeper,RegionServer
Table 6.1: Distributed Mode Sample Architecture
This guide assumes that all nodes are configured on same network and have full access to the machines ie. Assuming that for all
nodes no firewall rules has be defined.
Setting up password-less ssh
Node1 able to login to Node2 and Node2 without password for that we going to setup password-less SSH login from node1 to
each of the others.
On Node1 generate a key pair
Assume that we are going to all HBase services are run buy the user named ubuntu.generate the SSH key pair, using the following
command
$ sudo apt-get install ssh
$ ssh-keygen -t rsa -P ""
Note: Generated ssh key pair will be found on the location /home/ubuntu/.ssh/id_rsa.pub.
Prior to HBase 0.94.x, HBase expected the loopback IP address to be 127.0.0.1. Ubuntu and some other distributions default to
127.0.1.1 and this will cause problems for you.
$
$
$
$
$
$
$
$
$
Make sure that you are able to login to all the slaves without password.
Since node2 will run a backup Master, repeat the procedure above, substituting node2 everywhere you see node1. Be sure not to
overwrite your existing .ssh/authorized_keys files, but concatenate the new key onto the existing file using the >> operator rather
than the > operator.
Prepare Node1
Choose a download site from this list of Apache Download Mirrors. Click on the suggested top link. This will take you to a
mirror of HBase Releases. Click on the folder named stable and then download the binary file that ends in .tar.gz to your local
filesystem. Be sure to choose the version that corresponds with the version of Hadoop you are likely to use later. In most cases,
you should choose the file for Hadoop 2, which will be called something like hbase-0.98.3-hadoop2-bin.tar.gz. Do not download
the file ending in src.tar.gz for now.
$ vim conf/hbase-env.sh
uncomment JAVA_HOME entry in conf/hbase-env.sh and point you jdk location as /opt/jdk1.7.0_51 Example:
export JAVA_HOME=/opt/jdk1.7.0_51
33
Edit conf/regionservers
Edit conf/regionservers and remove the line which contains localhost. Add lines with the hostnames or IP addresses for node2
and node3. Even if you did want to run a RegionServer on node-a, you should refer to it by the hostname the other servers would
use to communicate with it. In this case, that would be node1.sample.com. This enables you to distribute the configuration to
each node of your cluster any hostname conflicts. Save the file.
Edit conf/hbase-site.xml
Edit conf/hbase-site.xml and add entry for zookeeper data directory and data directory for hbase. replace the contents with
the below contents.In this case we use HDFS for HBase storage and we going to run region server and zookeeper as seperate
daemons.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/ubuntu/yarn/hbase_data/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node1.sample.com,node2.sample.com,node3.sample.com</value>
</property>
</configuration>
Edit or create file conf/backup-masters and add a new line to it with the hostname for node2. In this demonstration, the hostname
is node2.example.com.
Note:
Everywhere in your configuration that you have referred to node1 as localhost, change the reference to point to the hostname that
the other nodes will use to refer to node1. In these examples, the hostname is node1.sample.com.
Prepare node2 and node3
Everywhere in your configuration that you have referred to node1 as localhost, change the reference to point to the hostname that
the other nodes will use to refer to node1. In these examples, the hostname is node1.sample.com.
Note:
Download and unpack HBase to node-b, just as you did for the standalone and pseudo-distributed.
Copy the configuration files from node1 to node2.and node3.
Each node of your cluster needs to have the same configuration information. Copy the contents of the conf/ directory to the conf/
directory on node2 and node3.
Start HBase Cluster
Important:
5605 HMaster
5826 Jps
5545 HQuorumPeer
5605
5826
5545
5930
HMaster
Jps
HQuorumPeer
HRegionServer
5826 Jps
5545 HQuorumPeer
5930 HRegionServer
In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from 60010 for the Master and 60030 for each
RegionServer to 16610 for the Master and 16030 for the RegionServer.Once your installation has been done properly you can
able to access UI for the Master http://node1.sample.com:60110/ or the secondary master at http://node2.sample.com:60110/ for
the secondary master, using a web browser. For debuging kindly refer logs directory.
$ bin/hbase shell
35
After installing HBase now its time to get started with HBase shell.Lets fire up Hbase shell using command bin/hbase shell
command.
$ bin/hbase shell
Create a table
Use create command to create table.you must specify table name and column family as argument for Create command.In this
below command we are creating a table called employeedb with column family finance.
hbase> create 'employeedb', 'finance'
0 row(s) in 1.2200 seconds
List tables
Delete a table
To delete a table in hbase you have to first disable the table then only you can able to delete the table. use disable command to
disable the table and enable command to enable the table.drop command to drop the table
hbase> disable 'employeedb'
0 row(s) in 1.6270 seconds
hbase> drop 'employeedb'
0 row(s) in 0.2900 seconds
Stopping HBase
37
Chapter 7
Hive Installation
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage.
7.1.1
Requirements
Hive requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation. See
Hadoop installation section for Oracle JDK or Open JDK Installation.
7.1.2
Installation Guide
Download Hive package from the hive.apache.org site and extract the packages using the following commands.In this installation
we use default derby database as metastore.
JAVA_HOME=/opt/jdk1.7.0_51
HADOOP_HOME=/home/ubuntu/hadoop-2.5.1
HIVE_HOME=/home/ubuntu/hive-0.14.0
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin
Esc -> :wq for save and quit from vim editor
Note:
Assumptions made that hadoop is install on home directory for ubuntu user (ie: /home/ubuntu/hadoop-2.5.1) and hive is instlled
on home directory of ubuntu user (ie: /home/ubuntu/hive-0.14.0)
$ source .bashrc
$ java -version
Console output :
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
$ $HADOOP_HOME/bin/start-all.sh
$ jps
After installing Hive now its time to get started with Hive shell.Lets fire up Hive shell using command bin/hive command.
$ $HIVE_HOME/bin/hive
Try out some basic commands listed below, for detailed documentation kindly refer apache documentation at hive.apache.org .
:> CREATE DATABASE my_hive_db;
:> DESCRIBE DATABASE my_hive_db;
:> USE my_hive_db;
:> DROP DATABASE my_hive_db;
:> exit
39
Chapter 8
Pig Installation
Apache Pig is a platform for analyzing large data sets. Apache pig can be run in distributed fashion on cluster. Pig consists of a
high-level language called pig latin for expressing data analysis programs. Pig is very similar to Hive, provides datawarehouse.
Using PigLatin we can able to analyze,Filter and Extract data sets.Pig internally converts PigLatin commands in to MapReduce
jobs and will execute on HDFS to retrive data sets.
8.1.1
Requirements
Pig requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation. See
Hadoop installation section for Hadoop Installation. Ensure to set HADOOP_HOME entry in .bashrc.
8.1.2
Installation Guide
Download Pig package from the pig.apache.org site and extract the packages using the following commands.
Extract Pig Package
$ $HADOOP_HOME/bin/start-all.sh
$ jps
Execution Modes
Local Mode :To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host
and file system. Specify local mode using the -x flag (pig -x local). Mapreduce Mode :To run Pig in mapreduce mode, you need
access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but dont need to, specify it
using the -x flag (pig OR pig -x mapreduce).
After installing Pig now its time to get started with Pig shell.Lets fire up Pig shell using command bin/pig command.
$ $PIG_HOME/bin/pig
Try out some basic commands listed below, for detailed documentation kindly refer apache documentation at pig.apache.org .
Invoke the Grunt shell by typing the "pig" command (in local or hadoop mode). Then, enter the Pig Latin statements interactively
at the grunt prompt (be sure to include the semicolon after each statement). The DUMP operator will display the results to your
terminal screen.STORE operator will store the pig results in the HDFS.
Note:
For using below commands kindly download employee dataset from this this and upload to HDFS in directory /dataset/employee.csv . and download manager.csv from this url and upload to HDFS in directory /dataset/manager.csv
41
select country, sum(amt) as tot_amt from employee group by country order by tot_amt
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
Category
Operator
LOAD
STORE
DUMP
Filtering
FILTER
DISTINCT
FOREACH...GENERATE
STREAM
SAMPLE
Grouping andjoining
JOIN
COGROUP
GROUP
CROSS
Sort,combine,split
ORDER
LIMIT
UNION
SPLIT
Description
Loads data fromthe filesystemor other
storage into a pig storage
Saves a pig results to any other storage
ex: HDFS
Prints a pig results to the console
Removes unwanted rows from a pig
relation
Removes duplicate rows froma pig
relation
Adds or removes fields froma pig
relation
Transforms a pig relation using an
external program
Selects a random sampleof a
pigrelation
Joins two or more relations
Groups the data in two or more pig
relations
Groups the data in a single pig relation
Creates the cross-product of two or
more pig relations
Sorts a pig relation by one or more
fields
Limits the sizeof a pig relation to a
maximum number of tuples
Combines two or more pig relations
into one
Splits a pig relation into two or more
pig relations
43
Chapter 9
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and
an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL
for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
9.1.1
Requirements
Apache Spark requires that a JDK and hadoop be installed.See JDK installation section for Oracle JDK or Open JDK Installation.
See Hadoop installation section for Hadoop Installation. Ensure to set HADOOP_HOME entry in .bashrc.
9.1.2
Installation Guide
Download Spark package for hadoop1 from the spark.apache.org site and extract the packages using the following commands.Choose
the appropiate hadoop versions. Note: this guide follows setting up apache spark with hadoop 1 version.
Extract Spark Package
$ $HADOOP_HOME/bin/start-all.sh
$ jps
Execution Modes
For Web UI
To launch a Spark standalone cluster with the launch scripts, you need to create a file called conf/slaves in your Spark directory,
which should contain the hostnames of all the machines where you would like to start Spark workers, one per line. The master
machine must be able to access each of the slave machines via password-less ssh (using a private key)
Once youve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoops deploy
scripts, and available in SPARK_HOME/bin:
sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
sbin/start-all.sh - Starts both a master and a number of slaves as described above.
sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
sbin/stop-all.sh - Stops both the master and the slaves as described above.
Note
these scripts must be executed on the machine you want to run the Spark master on, not your local machine.