Академический Документы
Профессиональный Документы
Культура Документы
Press
Old No. 38, New No. 6
McNichols Road, Chetpet
Chennai - 600 031
First Published by Notion Press 2017
Copyright © Prashant Nair 2017
All Rights Reserved.
ISBN 978-1-947752-07-8
This book has been published with all reasonable efforts taken to make the
material error-free after the consent of the author. No part of this book shall be
used, reproduced in any manner whatsoever without written permission from the
author, except in the case of brief quotations embodied in critical articles and
reviews.
The Author of this book is solely responsible and liable for its content including
but not limited to the views, representations, descriptions, statements,
information, opinions and references [“Content”]. The Content of this book shall
not constitute or be construed or deemed to reflect the opinion or expression of
the Publisher or Editor. Neither the Publisher nor Editor endorse or approve the
Content of this book or guarantee the reliability, accuracy or completeness of the
Content published herein and do not make any representations or warranties of
any kind, express or implied, including but not limited to the implied warranties
of merchantability, fitness for a particular purpose. The Publisher and Editor
shall not be liable whatsoever for any errors, omissions, whether such errors or
omissions result from negligence, accident, or any other cause or claims for loss
or damages of any kind, including without limitation, indirect or consequential
loss or damage arising out of use, inability to use, or about the reliability,
accuracy or sufficiency of the information contained in this book.
Dedication
To my parents, my teachers, my students and my wife!
Contents
Preface
What This Book Covers?
Lab Exercises Covered in This Book
Chapter 1, Introducing Bigdata and Hadoop, introduces you with the world of
Bigdata and explores the roles and responsibilities of a Hadoop administrator
Chapter 2, Apache Hadoop Installation and Deployment, deep dive right from
building hadoop-2.8.0 to installing and configuring Hadoop in standalone,
pseudo-distributed and distributed mode.
Chapter 3, Demystifying HDFS, talks in detail about HDFS storage, how it
operates and teaches how to access HDFS layer using CLI and NFS gateway. It
also covers some of the common administration tasks associated with HDFS.
Chapter 4, Understanding YARN and Schedulers, helps reader understanding
YARN internals and best practices while setting up YARN in production cluster.
It also talks about implementing schedulers like Capacity and Fair Schedulers.
Chapter 5, HDFS Federation and Upgrade, helps the reader to understand
concerns of HDFS architecture in terms of multi-tenancy which is addressed
using Federation. It also talks about how to implement HDFS Federation in the
cluster and how to perform HDFS cluster upgrade from Gen1 to Gen2 and
performing rolling upgrade.
Chapter 6, Apache Zookeeper Admin Basics, deals with understanding and
implementing Zookeeper in standalone and leader-follower mode. We also
discuss how to use zookeeper CLI to see the content in Zookeeper filesystem.
Chapter 7, High Availability in Apache Hadoop, deals with understanding how
to overcome the single point of failure of Namenode system. We will be
implementing High Availability using QJM. We will also build and implement
HA on a federated cluster.
Chapter 8, Apache Hive Admin Basics, talks about how hive works, installing
hive with MySQL metastore and using Hiveserver2 with beeline client and
securing the same.
Chapter 9, Apache HBase Admin Basics, deals with understanding HBase
architecture, installing and configuring HBase with single node, single master,
and multi-master setup. After that we will learn some basic HBase shell
commands and admin commands. Lastly, we will learn how to integrate HBase
with Hive and how to perform bulk uploading data in HBase.
Chapter 10, Data Acquisition using Apache Sqoop, talks about how to do data
transfer between RDBMS and Hadoop. We will be learning CLI commands with
some variations to deal with Sqoop’s import and export operation.
Chapter 11, Apache Oozie, deals with how to create and schedule workflow in
Oozie. We will be building, installing and configuring Oozie. Once done we will
learn how to create a simple workflow.
Chapter 12, Installing Apache Pig, Spark and Flume, deals with installation
and configuration of Apache Pig, Apache Spark and Apache Flume.
Lab Exercises Covered in This Book
Chapter Lab
Description
No. No.
2 1 Building Apache Hadoop 2.8.0 from Scratch
Setting up Apache Hadoop 2.8.0 in Standalone Mode (CLI-
2 2
Minicluster Mode)
Setting up Apache Hadoop 2.8.0 in Pseudo-Distributed Mode
2 3
(Single Node Cluster)
Setting up Apache Hadoop 2.8.0 in Distributed Mode (Multinode
2 4
Cluster)
3 5 Working with HDFS Filesystem Shell Commands
3 6 Setting up Replication Factor of an Existing Cluster
Dynamically Setting up Replication Factor During Specific File
3 7
Upload
3 8 Setting ip Block Size in Existing Hadoop Cluster
Adding Nodes in an Existing Hadoop Cluster Without Cluster
3 9
Downtime
Decommissioning Datanode in Existing Hadoop Cluster Without
3 10
Data Loss and Downtime in the Cluster.
3 11 Whitelisting Datanodes in an Hadoop Cluster.
3 12 Working with Safemode (Maintenance Mode) in Hadoop
3 13 Checkpointing Metadata Manually
3 14 Setting up NFS Gateway to Access HDFS
3 15 Setting up Datanode Heartbeat Interval
3 16 Setting up File Quota in HDFS
3 17 Removing File Quota in HDFS
3 18 Setting up Space Quota in HDFS
3 19 Removing Space Quota in HDFS
Configuring Trash Interval in HDFS and Recovering Data from
3 20
Trash
4 21 Creating Multiple Users and Groups in Ubuntu System
4 22 Setting up Capacity Scheduler in YARN
4 23 Setting up Fair Scheduler in YARN
5 24 Setting up HDFS Federation in a 4 Node Cluster
5 25 Implementing ViewFS in Existing 4 Node Federated Cluster
5 26 Performing Hadoop Upgrade from Gen1(1.2.1) to Gen2(2.7.3)
6 27 Setting up Zookeeper in Standalone Mode
6 28 Setting up Zookeeper in Leader-Follower Mode
6 29 Running Basic Commands in Zookeeper CLI
Installing and Configuring 4-Node HDFS HA-Enabled Fresh
7 30
Cluster
Configuring YARN ResourceManager HA in the 4 Node HDFS
7 31
HA-Enabled Cluster
Configuring HDFS and YARN ResourceManager HA in an
7 32
Existing Non-HA Enabled Cluster Without Any Data Loss.
7 33 Building a Federated HA-Enabled Cluster
Performing Rolling Upgrade from Hadoop-2.7.3 to Hadoop-2.8.0
7 34 in an Existing 4 Node HDFS and YARN RM HA-Enabled
Cluster
Setting up Apache Hive with MySQL Database as a Metastore
8 35
Server.
8 36 Connecting Beeline Client to hiveserver2
8 37 Configuring hiveserver2 to Secure Beeline Client Access
What is Bigdata?
IBMs definition on Bigdata
Types of Bigdata
Typical Bigdata Project Phases
Introducing Hadoop
Features of Hadoop
Role of Hadoop Administrator in Bigdata Industry
Creating your lab setup for hands-on exercise
What is Bigdata?
Whenever I start my training on Bigdata Hadoop Development or Hadoop
Administration, the first question I ask my participants is, “WHAT IS
BIGDATA?” It may sound crazy because the participants are here to learn the
same. However, the reason why I ask this question is to understand their mindset
of the Bigdata paradigm. I also want you to take a minute and think what
Bigdata is as per your viewpoint.
If you ask me, Bigdata is a term - not a technology, neither a tool. It’s all about
the ability of the software, or framework, or architecture to handle the data
(Handle refers to LOAD, PROCESS, and STORE/FORWARD data). So during
any execution process, if an application or infrastructure fails due to data
parameter, I can say I am facing a Bigdata problem. Think of it this way: there is
a text file of size 3GB named file1.txt and you want to open the file. If you try
opening it using Notepad, you will observe Notepad will not respond. However,
if you try to open the same file using WordPad, you will see some latency
introduced. But, WordPad will be able to load the data in RAM. I know this
example sounds lame, but try to see the hidden message. In this example,
Notepad faced a Bigdata Problem. Notepad was not able to open the file even
though the system had enough disk space, RAM, and OS support. But when I
opened the same data using WordPad, I was successful because for WordPad it is
just normal data. Thus, any data can be a Bigdata.
IBMs Definition of Bigdata
According to IBM, everyday data which comes from anywhere like the sensors,
social media, GPS etc. is Bigdata. Bigdata spans three dimensions viz. 3 V’s -
Volume, Velocity, and Variety.
Ideally, Bigdata is not just about size. It is about gaining insight. That’s why
IBM also introduced the fourth V i.e. Veracity or Value which is all about getting
some value or insights out of the data.
Types of Bigdata
Any Bigdata that is meant for processing can be categorized as follows:
Structured Data is an organized data consisting of elements that are
addressable during processing and analysis. For example, data stored in the
database can be considered as structured data.
Semi-structured data is also an organized data but it does not follow any
formal structure. It does contain markers or tags to enforce hierarchies of fields
and records within the data. For example, an XML and JSON data is considered
semi-structured data.
Unstructured data is the actual raw data. It does not have a formal structure or
metadata. Usually this kind of data can be converted into a structured data but it
is a tedious and time consuming process. Some of the examples of unstructured
data include text files, images, videos, pdfs, and so on.
Data view phase is concerned with working with cleansed and logically
structured data. This phase is categorized by creating ability to perform real-time
queries to get the desired output and broadcast the same in a web application.
For example, the output generated in the previous phase contains all Mumbai
data. The same can be queried based on certain conditions and can be
broadcasted to a web application. Please note that in this phase, we deal with
structured data. We generally use NoSQL components or SQL components to
achieve this phase.
Visualization and Analytics phase gathers intelligence over the data and its
subsequent usage. As an example, we can consider data representation in the
form of a two-dimensional graph or predictions, recommendations, and so on.
Introducing Hadoop
Apache Hadoop is a Java-based framework that is meant for distributed batch
processing and distributed storage for extremely large amounts of data. It is a
highly scalable and reliable framework across a cluster of commodity hardware.
Hadoop can handle any kind of data, be it structured, semi-structured, or
unstructured data. However, Hadoop was designed to handle unstructured data.
Hadoop was created by Doug Cutting, the creator of Apache Lucene, a widely
used text search library. Hadoop has its origins in Apache Nutch, an open source
web search engine, that itself is a part of the Lucene project.
The core of Apache Hadoop consists of the storage part of Hadoop, officially
termed as HDFS i.e. Hadoop Distributed File System. The processing part of
Hadoop is named MR programming model i.e. the MapReduce model. Apache
Hadoop MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System.
Apache Hadoop has two generations. They are Generation1 and Generation 2.
In Generation1, there exists HDFS and MR. In Generation2, there exists HDFS
and YARN. YARN stands for Yet Another Resource Negotiator and is meant for
cluster management and processing tool management. The default auxiliary
processing service in Generation2 Hadoop is MRv2. In this book, you will learn
about each of these components in detail.
Features of Hadoop
Let us see some key features of Apache Hadoop.
8GB RAM.
150GB free hard disk space.
Intel Core i3 processor or later supporting Intel VT-x and VT-d.
Windows 7 or later/Ubuntu 12.04 or later/Mac OSX or later.
Virtualization Software like VMware Workstation or Oracle
VirtualBox.
Recommended Requirements:
16GB RAM.
250GB free hard disk space.
Intel Core i5 processor or later supporting Intel VT-x and VT-d.
Windows 7 or later/Ubuntu 12.04 or later/Mac OSX or later.
Virtualization Software like VMware Workstation or Oracle
VirtualBox.
We will be using Virtual Machines for our practice. However, feel free to use
multiple machines connected in the network if applicable to you.
I will be using Ubuntu 14.04 LTS Server OS Virtual machine for my hands-on
exercise. You can download these VMs using the link
https://bigdataclassmumbai.com/hadoop-book
Following are the things you will get in the above link:
Summary
Step 2: Setup SSH password-less setup. It is required for Hadoop daemons since
they get initialized using SSH. This can be done using the following steps:
Problem Statement:
We have 4 nodes and we need to setup a Multinode Cluster as shown in the
above network diagram.
Solution
Understand your setup and fill in the table given below. Ensure that in Step 1
(under hosts file configuration) you add the IP address of your machine.
Node IP Address for Example Your Machine/VMs IP
Desired Hostname
Name Purpose address
Node1 node1.mylabs.com 192.168.1.1
Node2 node2.mylabs.com 192.168.1.2
Node3 node3.mylabs.com 192.168.1.3
Node4 node4.mylabs.com 192.168.1.4
Step 1: Setup network Hostname and host file configuration on all 4 machines.
In a real production setup, we use a dedicated DNS server for Hostnames.
However, for our lab setup, we will setup local host files as resolvers. Hadoop
ideally recommends working on Hostname rather than IP addresses for host
resolution.
Perform the following steps in each machine.
sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com node1.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.
192.168.1.1 node1.mylabs.com
192.168.1.2 node2.mylabs.com
192.168.1.3 node3.mylabs.com
192.168.1.4 node4.mylabs.com
Once the configuration is done, you will need to restart the machines for
Hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
Step 2: Setup SSH password-less setup between your NameNode system and the
other system. In our example setup, we need an SSH setup password-less
configuration between node1.mylabs.com and the other nodes participating in
the cluster. This is done so that the node1 can contact other nodes to invoke the
services. The reason for this kind of configuration is that the NameNode system
is the single point of contact for the users and administrators.
Perform the following commands on node1:
What is HDFS?
HDFS Architecture.
HDFS Write Operation.
HDFS Read Operation.
Common FileSystem Shell Commands.
Replication in HDFS.
Setting up Block Size in HDFS.
Adding Nodes in clusters.
Decommissioning Nodes in clusters.
Whitelisting DataNodes.
Safemode in Hadoop.
Implementing NFS Gateway in HDFS.
Configuring DataNode heartbeat interval.
Setting up Name Quotas and Space Quota in HDFS.
Enabling Recycle Bin for Hadoop HDFS (Trash).
What is HDFS?
HDFS stands for Hadoop Distributed File System. It is the storage part of
Hadoop that allows Bigdata files to be stored across the cluster in a distributed
and reliable manner. HDFS allows the distributed application to access the data
for distributed processing and analysis quickly and reliably. HDFS is a non-
POSIX compliant FileSystem since it is an append-only FileSystem and does not
honor POSIX durability semantics.
Let us try and understand what HDFS is all about. What does it do? HDFS has
been made to handle large files. HDFS can handle small files but ideally HDFS
is designed to handle large files in the most optimal manner such that the file can
be distributed, broken down, and stored in an entire cluster. Bigdata can be data
from sensors, satellite data, user activity tracking, webserver logs and many
more. HDFS stores the data by breaking it into smaller blocks. These blocks are
the logical units of data that are stored in HDFS. In this case, the default block
size is 64MB in Generation1 Hadoop and 128MB in Generation2 Hadoop. The
benefit of splitting up of large files into blocks is multiple systems in the cluster.
They will maintain the blocks resulting in participation of multiple systems
during processing, hence introducing parallel processing in the cluster. This
makes your operation faster as compared to operations running on a single
machine.
HDFS internally stores the data in the form of blocks that are stored across
several nodes. It can store multiple backup copies of individual blocks to
maintain high availability of data stored in HDFS and can also tolerate failures
(when one of the computers in the cluster goes down/stops functioning). The key
factor of HDFS is that you can achieve the same results using commodity
hardware. There is no need to work on specialized hardware.
Now that we know HDFS in a higher level, let us explore the corresponding
architecture.
HDFS Architecture
The following architecture depicts a typical HDFS setup in a 5 node cluster:
a. NameNode
b. DataNode
c. SecondaryNameNode
The above cluster shown has a total HDFS space of 1500GB. This is because
the system that holds the DataNode service is only considered by HDFS for its
storage operations. Let us now learn the role of each daemon in detail.
NameNode daemon is the master daemon for HDFS storage. NameNode is the
single point of contact for distributed applications and clients. It is the
NameNode that maintains the index information of the files that are stored in
HDFS. When we say index information, we refer to Namespace information and
block information. NameNode also maintains the mapping between files and
blocks. There is a logical entity that is stored in the HDFS file in the form of a
set of blocks (where blocks are the actual physical entity that is maintained by
DataNode). NameNode maintains all the index information in a critical file
called fsimage and edits. The edits file is responsible for maintaining index
information of the on-going session in the main memory. However, it is snapshot
in the persistent storage (after NameNode daemon starts), whereas fsimage is
meant to maintain information of the entire blockpool in persistent storage when
the NameNode daemon starts.
DataNode daemon is the slave daemon for HDFS storage. The DataNode
daemon is responsible for performing read/write operations on data that is stored
in the HDFS. It is the DataNode daemon that maintains the blocks. It also
performs the replication asynchronously whenever applicable. The default
replication factor of HDFS is 3 which will mean that 3 copies of each block will
be stored in the HDFS layer if applicable. It is the DataNode that interacts with
the application or client when it comes to PULL or PUSH operation of blocks.
SecondaryNameNode daemon is also called the Checkpoint Node in the
Hadoop community. As the name suggested by Hadoop community, this daemon
is used to maintain the checkpoint of the HDFS metadata. Let us try to
understand the need of the SecondaryNameNode before getting in to further
detail about the same. As discussed above, NameNode maintains the HDFS
metadata in two files namely, fsimage and edits, wherein edits maintains the
changes that happen in the FileSystem of an on-going session. Now, the edits
logs that are generated will be applied to fsimage only when we restart the
NameNode service or manually flush the same. But in reality, restarting
NameNode in a production cluster is very rare. It results in high volume of data
in edits logs that will be challenging to manage. If we assume that there is a
crash, the entire edits data gets deleted. Thus resulting in data loss and orphaned
blocks. To overcome this issue, SecondaryNameNode comes handy. It takes over
the responsibility of merging edits with fsimage. The whole purpose of
SecondaryNameNode is to perform a checkpoint with NameNode and to ensure
that the entire snapshot of the HDFS is safely maintained. However, this is not a
replacement or failover solution. In Generation 1, we rely on
SecondaryNameNode daemon for recovery of HDFS. It is, however, not
considered a feasible solution for NameNode failures using manual techniques.
Thus in Generation2, Hadoop introduced the concept of a standby NameNode
which we will see later.
Since we have now understood the daemons and their roles, let us understand
how HDFS performs write operation.
diagram:
As shown above, we have a client machine that wants to upload the file named
file1.txt from this machine to HDFS. The size of the file is 120MB. Let us take
the scenario shown below: Total HDFS space: 1500GB
Available HDFS space: 1500GB
Block Size set for HDFS cluster: 64MB
Replication Factor: 2
Let us understand this step-by-step process:
Step 1:
Client will initiate the process of uploading data using the HDFS Client. The
HDFS client will logically figure out how many blocks should be created for the
input file. In our case, the HDFS client will break the file into two blocks (based
on the block size of 64MB). For each block, the HDFS client will contact
NameNode for the metadata and DataNode information where the data is to be
stored. The NameNode will receive the request to create a file and will initially
check for the availability of space to store the file. Considering the replication
factor, it then checks whether the file exists or not and for any Quotas if
applicable. In this case, the file size is 120MB and replication factor is 3. Thus,
the required space to perform this operation is 360MB. If this space is available,
only then will NameNode initiate the second step, else it will throw an
IOException.
Step 2:
NameNode, based on the information received from the client, will start
generating the metadata and block information that includes
In our example, there will exist two blocks namely B1 and B2 that will be
stored in DN1 and DN2 for B1 and DN2 and DN3 for B2 considering the
replication factor of the cluster as 2. Once the metadata is generated, the same
shall be given to the client as a response.
Step 3:
The Client, based on the metadata received, will initiate the block transfer to
the destined DataNode. The Client will also ensure that the DataNode will
initiate the replication process once the block is copied in DataNode. In our case,
the client machine will initiate block transfer B1 to datanode DN1 and will also
inform DN1 that initiating the replication process to DN2 has to be as shown
below:
This process is called the DataNode pipeline that takes the accountability of
replication. Once DN2 receives the data, a FINISH flag will be sent to DN1.
From DN1, the FINISH flag will be sent to both NameNode and Client. The
same process happens for block B2 also.
Once the FINISH for each block is received by the client, the client will then
send the FINISH flag to the NameNode. Also at the same time, DataNodes will
send the block report to NameNode confirming that all blocks are placed as per
the specification. Let us now understand read operation in HDFS.
To continue with this demo, please create a 5 node cluster as explained in the
previous chapter.
Step 1: Open hdfs-site.xml in node1.mylabs.com and add the following property
within the configuration tag.
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Step 2: Copy the modified hdfs-site.xml file in the remaining nodes.
scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml Step 3: Try uploading file
hdfs dfs -mkdir /testdata
hdfs dfs -copyFromLocal /home/hadoop/sample /testdata/sample
hdfs dfs -stat %r /testdata/sample
You will see that the file sample is replicated in 2 DataNodes. You can also
check the same in WebUI.
Please note, while using this method, we have set the replication factor for all
future uploads. All existing files will still hold the same replication factor that
was been set during the upload time. Lets take an example to change the
replication factor of the existing file.
hdfs dfs -mkdir /data1
hdfs dfs -put /home/hadoop/sample /data1/sample
hdfs dfs -stat %r /testdata/sample
The above command will upload the file with RF=2. Let us assume you want
to change RF to 1. You can do it using -setrep command as shown below: hdfs
dfs -setrep 1 /data1/sample
hdfs dfs -stat %r /testdata/sample
Lab 7 Dynamically Setting up Replication Factor during Specific File
Upload
You can also set the replication factor during the upload. This can be done using
the following command: hdfs dfs -D dfs.replication=1 -copyFromLocal
/home/hadoop/sample /testdata/sample2
In this scenario, we will setup block size as 64MB. We need to setup hdfs-
site.xml with the following property: <property>
<name>dfs.blocksize</name>
<value>64m</value>
</property>
After setting it up, try uploading the file. You will observe that the block size
set for this file will be 64MB. You can represent the value suffix as k (kilo), m
(mega), g (giga), t (tera), p (peta), e (exa) to specify the sizes. If you don’t
specify any suffix, the same will be considered in bytes.
In this section, we will see how to add the 5th node in the existing 4 node
Hadoop cluster created as per the previous chapter. This cluster is running live
and active. We will add a new node named node5.mylabs.com.
Lab 9 Adding Nodes in an Existing Hadoop Cluster without Cluster
Downtime
Once the configuration is done, you will need to restart the machine for
hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
You need to also make an entry of the new system in the host file of
node1.mylabs.com, node2.mylabs.com, node3.mylabs.com, and
node4.mylabs.com. Please note that you do not need to restart, since we are only
modifying the host file and not touching the hostname file. The intent of this step
is to resolve node5 by all machines participating in the cluster.
Step 2: Setup SSH password-less setup between your NameNode system and the
node5.mylabs.com node. In our example setup, we need to setup password-less
configuration between node1.mylabs.com and node5.mylabs.com. This is done
so that the node1 can contact the node5 to invoke hadoop services. The reason
for this kind of configuration is that the NameNode system is the single point of
contact for all the users and administrators.
Perform the following commands on node1:
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@node5.mylabs.com Step 3: Install Hadoop 2.8.0 in Standalone
mode in node5.mylabs.com. Refer Setting up Hadoop-2.8.0 in Standalone
Mode (CLI MiniCluster) in the previous chapter in case you require
additional help.
Step 4: Copy core-site.xml, mapred-site.xml, hdfs-site.xml, and yarn-site.xml
from node1.mylabs.com to node5.mylabs.com.
On node1.mylabs.com perform the following commands,
scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node5.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml
hadoop@node5.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node5.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node5.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml Step 5: Setup Slave configuration in the cluster. We need to
perform this step only in Node1. This configuration helps in informing the
cluster, which nodes will have Slave services (DataNode and NodeManager).
Usually, this configuration is placed only in the NameNode machine. You
need to just append node5.mylabs.com on the last line of the file.
vi /home/hadoop/hadoop2/etc/hadoop/slaves
#Delete localhost and replace the same with node3 and node4 as
shown node3.mylabs.com
node4.mylabs.com
node5.mylabs.com
Step 6: Start DataNode and NodeManager service in node5.mylabs.com
hadoop-daemon.sh start datanode
yarn-daemon.sh start nodemanager
Verify whether the services are running or not using ‘jps’ command. Also get
the block report to ensure node5 is now a part of the clustr successfully using the
following command Step 7: Since we have already scaled the cluster, we will
need to balance the load of the cluster. This can be achieved using the following
command: On node1.mylabs.com, type the following command
start-balancer.sh
Decommissioning the Datanodes in Existing Cluster
Decommissioning is a process by which we gracefully remove the DataNode
from the running cluster without affecting any storage and processing activities
triggered by Hadoop or similar applications. The responsibility of that specific
DataNode which is planned to be decommissioned will be assigned to other
DataNodes and the NameNode will keep a track of it and update the same in the
metadata.
Lab 10 Decommissioning Datanode in Existing Hadoop Cluster Without
Data Loss and Downtime in the Cluster In this section, we will see how
to decommission the 4th node in the existing 5 node Hadoop cluster.
Step 1: Create a file in the NameNode system i.e. node1.mylabs.com that holds
the hostname of the node which is to be decommissioned. You can set any name
to this file. In this case, let us create a file named ‘remove.txt’.
vi /home/hadoop/remove.txt
node4.mylabs.com
Step 2: Open hdfs-site.xml and add the following property within the
configuration tag.
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/remove.txt </value>
</property>
Step 3: Refresh the cluster to apply the changes in real time.
hdfs dfsadmin -refreshNodes
You will observe that the changes will be applied and node4.mylabs.com will
change the state from NORMAL to DECOMMISSIONING IN PROGRESS.
Once decommissioning is complete, the state will change to
DECOMMISSIONED. You can track the status either using WebUI of
NameNode or by using the following command: hdfs dfsadmin -report
Step 1: Create a file in the NameNode system i.e. node1.mylabs.com that holds
the set of Hostnames of the nodes that are to be whitelisted. You can set any
name to this file. In this case, let us create a file named ‘allow.txt’.
vi /home/hadoop/allow.txt
node3.mylabs.com
node4.mylabs.com
node5.mylabs.com
Step 2: Open hdfs-site.xml and add the following property:
<property>
<name>dfs.hosts</name>
<value>/home/hadoop/allow.txt</value>
</property>
Step 3: Refresh the cluster to apply the changes in real time.
hdfs dfsadmin -refreshNodes
Understanding Safemode in Hadoop
Safemode is a representation that Hadoop is in read-only mode. It is ideally
meant for transitioning the cluster from production to maintenance mode. In this
mode, HDFS doesn’t allow any write-related and processing operations.
However, read is possible. When we start Hadoop services, HDFS service goes
in safemode to perform the following activities:
However, a user can bring the Hadoop cluster in safemode for the following
reasons:
a. Check-pointing metadata.
b. Removing orphaned blocks.
c. Removing corrupted blocks and its metadata.
When it comes to the NFS gateway in our Hadoop cluster, we can make our edge
server, NameNode server or any DataNode, an NFS gateway. However, in
production, we prefer to make our edge server the NFS gateway. Let us see how
to configure it.
Step 1: Stop all Hadoop services.
stop-all.sh
Step 2: Update/Add core-site.xml with the following parameters. If it already
exists, don’t make any changes.
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>hadoopvm</value>
</property>
Please note that in the above property names, hadoop is the username that has
access to HDFS operations as highlighted below:
hadoop.proxyuser.hadoop.groups hadoop.proxyuser.hadoop.hosts Step 3:
Update/Add hdfs-site.xml with the following parameters. If it already exists,
don’t make any changes.
<property>
<name>dfs.nfs3.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
<property>
<name>dfs.nfs.rtmax</name>
<value>1048576</value>
</property>
<property>
<name>dfs.nfs.wtmax</name>
<value>65536</value>
</property>
<property>
<name>dfs.nfs.exports.allowed.hosts</name> <value>*
rw</value>
</property>
Step 4: Start Portmap as a root user/mode and verify the same.
sudo -u root hadoop2/sbin/hadoop-daemon.sh start portmap sudo
jps
Step 5: Start NFS3 Gateway protocol as a normal user.
hadoop-daemon.sh start nfs3
Step 6: Create a mount directory and mount HDFS on that directory.
sudo mkdir /mnt/hdfs-cluster
sudo mount -t nfs -o vers=3,proto=tcp,nolock,noacl,sync
hadoopvm:/ /mnt/hdfs-cluster/
You have completed the process. Now let us test the setup.
Step 1: Traverse the mounted directory and observe the contents.
ls /mnt/hdfs-cluster
You will see that all directory structures of your current HDFS will be
reflected here. This shows that your setup is successful.
Step 2: Lets create a folder in HDFS in named prashantnfs in /(root) location
using NFS.
cd /mnt/hdfs-cluster
mkdir prashantnfs
Observe HDFS WebUI. You will see that the folder will be reflected in HDFS.
Step 3: Now let us copy one file and delete the same.
cp /home/hadoop/sample .
hadoop fs -ls /prashantnfs (You will see that the file is copied in HDFS)
rm -rf sample
hadoop fs -ls /prashantnfs (You will see that the file is deleted in HDFS)
Configuring DataNode Heartbeat Interval
Let us first understand the idea behind the heartbeat interval. DataNode notifies
the NameNode about its presence and its state through a heartbeat. The default
heartbeat interval is 3 seconds. However, the same is configurable. In case the
DataNode fails to send the heartbeat within 3 seconds frame, it does not mean
that the NameNode will declare the DataNode DEAD. In fact, the NameNode
takes a decision based on heartbeat re-check interval (default value: 10 minutes).
Thus, if any DataNode does not send any heartbeat within 10 minutes and 3
seconds, then the NameNode will mark that specific DataNode as DEAD. In real
production, sometimes we may need to change these configurations as per the
request of our Architect. The reasons can be many. Some of which are
application timeouts, limited network bandwidth and so on.
Lab 15 Setting up Datanode Heartbeat Interval
Let us learn how to set the heartbeat interval of 5 seconds and a heartbeat-
recheck interval of 15 minutes.
Step 1: Append the following properties in hdfs-site.xml of the NameNode
system i.e. node1.mylabs.com <property>
<name>dfs.heartbeat.interval</name>
<value>5</value>
</property>
<property>
<name>dfs.namenode.heartbeat.recheck-interval</name>
<value>900000</value>
</property>
In the above configuration, dfs.heartbeat.interval accept values in seconds
whereas dfs.namenode.heartbeat.recheck-interval values in milliseconds.
Step 2: Copy hdfs-site.xml in all the nodes participating in the cluster (Assuming
you are working on 5 node cluster).
On node1.mylabs.com perform the following commands:
scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node3.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node4.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node5.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml Setting up Quotas
In Hadoop, we can set two kinds of quotas:
a. File Quota
b. Space Quota
Let us put the above data in a table so that it is easier to understand each
parameter.
In case you want to remove the file quota, you can use the following command:
hadoop dfsadmin -clrQuota /prashant
To check whether the quota is removed or not, type the following command:
hdfs dfs -count -q /prashant
The following output will be displayed:
c) Setting Space Quota
Lab 18 Setting up Space Quota in HDFS
In order to limit the user with respect to the space consumption, you can use
Space Quota. Let us see an example to perform the Space Quota in the same
directory.
hadoop dfsadmin -setSpaceQuota 20M /prashant
The above command will set the quota of 20MB. Once it exceeds, it will give
the user an exception.
To check whether the quota is removed or not, type the following command:
hdfs dfs -count -q /prashant
The following output will be displayed:
Let us put the above data in a table so that it is easier to understand each
parameter.
To clear/reset the space limitation you can perform the following command:
hadoop dfsadmin -clrSpaceQuota /Prashant
To check whether the quota is removed or not, type the following command:
hdfs dfs -count -q /prashant
The following output will be displayed:
none inf none inf 1 2 36 /prashant
Enabling Recycle Bin in Hadoop HDFS (Trash Configuration)
As the name suggests, in this section, we will learn how to setup trash
configuration. This is beneficial when it comes to recovering the data that has
been deleted/archived accidently.
Lab 20 Configuring Trash Interval in HDFS and Recovering Data from
Trash
Let us see how to set this up. The assumption here is that you are performing this
demo on a MultiNode cluster. However, you can do the same on any kind of
cluster configuration.
Step 1: Configure core-site.xml of the NameNode machine. For me, it is
node1.mylabs.com. Append the following property in the configuration file: vi
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
Here, the value set is in minutes. So in this configuration we set the trash
interval as 7 days.
Step 2: Once the above configuration is done, you need to copy the file in other
systems participating in the cluster.
scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node2.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/core-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node3.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/core-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node4.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/core-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node5.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/core-site.xml Step 3: Restart the
NameNode service in node1.mylabs.com
hadoop-daemon.sh stop namenode
hadoop-daemon.sh start namenode
Step 4: Now let us test whether the trash is working or not. To do so, let us
upload one temporary file and delete the same.
hdfs dfs -mkdir /deletetest
hdfs dfs -put /home/hadoop/sample /deletetest/sample
hdfs dfs -ls -R /deletetest
hdfs dfs -rm -r /deletetest/sample
INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 10080 minutes, Emptier interval = 0 minutes.
Moved: ‘hdfs://hadoopvm:8020/deletetest/sample’ to trash at:
hdfs://hadoopvm:8020/user/hadoop/.Trash/Current In case you
want to recover the deleted file, you can do the same using the copy
command: hdfs dfs -cp
/user/hadoop/.Trash/Current/deletetest/sample /deletetest/sample
Summary
What is YARN
Need of YARN
Understanding YARN operation
Calculating and Configuring YARN parameters
Schedulers in YARN
Configuring Capacity Scheduler
Configuring Fair Scheduler
What is YARN?
YARN (Yet Another Resource Negotiator) is one of the most important
component in-terms of processing scalability. Ideally, in Generation1 Hadoop,
the only processing that was supported, was MR processing which is a batch-
oriented processing. However, when YARN was introduced, introduction of
other processing techniques became possible. YARN is a general purpose,
distributed, application management framework. YARN is a cluster management
framework which supports multiple java based processing frameworks like
Apache Spark, Apache Tez, Apache Flink, etc. Today, YARN is considered to be
a pre-requisite for Enterprise Hadoop Setup.
YARN enhances the capability of your existing Hadoop cluster by achieving:
Need of YARN
The very first two questions generally arise, once people know about Hadoop
Generation1, are:
Why do we need YARN?
Why do we need a new resource management technique in a
distributed computing system?
Precisely, as the name YARN suggest, it’s just another resource management
framework. In Hadoop Gen1, there exists HDFS and MR as shown in figure
below:
HDFS, as discussed in previous chapter, refers to the storage. MapReduce is
the processing part of Hadoop. The services associated with MapReduce are
JobTracker and TaskTracker, where JobTracker is the master service and
TaskTracker is the slave service. Following are the operations done by
JobTracker:
JobTracker is the one who takes the processing request from the client.
JobTracker talks to the Namenode to get the metadata of the data which
is to be processed.
JobTracker figures out the best TaskTracker in the cluster for execution
based on data locality and available slots to execute a task of a given
operation.
JobTracker monitors the TaskTracker’s progress and reports back to the
client application.
If the above three checks are positive, then Resource Manager will create an
ApplicationMaster specific to that application. It will also instruct the
NodeManager to create containers which will perform the execution operation
which will be supervised by that ApplicationMaster. Once the job execution is
completed, the ResourceManager will ensure to perform Resource De-allocation
and Garbage collection.
Step 2: Calculate the Reserved Memory. Ideally reserved memory is the RAM
needed by the system processes and other Hadoop processes.
RESERVED MEMORY = Reserved for System memory + Reserved for
HBASE Memory (If HBase is on same node).
The values for system memory and HBase memory can be derived from the
table below:
Total Memory Recommended Reserved Recommended Reserved
per Node System Memory HBase Memory
4GB 1GB 1GB
8GB 2GB 1GB
16GB 2GB 2GB
24GB 4GB 4GB
48GB 6GB 8GB
64GB 8GB 8GB
72GB 8GB 8GB
96GB 12GB 16GB
128GB 24GB 24GB
256GB 32GB 32GB
512GB 64GB 64GB
For mapred-site.xml
Property Name Property Value (Formula)
mapreduce.map.memory.mb RAM per container
mapreduce.reduce.memory.mb 2 * RAM per container
mapreduce.map.java.opts 0.8 * RAM per container
mapreduce.reduce.java.opts 0.8 * 2 * RAM per container
yarn.app.mapreduce.am.resource.mb 2 * RAM per container
yarn.app.mapreduce.am.command-opts 0.8 * 2 * RAM per container
For mapred-site.xml
Property Name Property Value (Formula)
mapreduce.map.memory.mb RAM per container
= 2GB
= 2048MB
mapreduce.reduce.memory.mb 2 * RAM per container
= 2 * 2
= 4GB
= 4096MB
mapreduce.map.java.opts 0.8 * RAM per container
= 0.8 * 2
= 1.6GB
= 1638MB
mapreduce.reduce.java.opts 0.8 * 2 * RAM per container
= 3.2GB
= 3276MB
yarn.app.mapreduce.am.resource.mb 2 * RAM per container
= 2 * 2
= 4GB
= 4096MB
yarn.app.mapreduce.am.command-opts 0.8 * 2 * RAM per container
= 3.2GB
= 3276MB
a. FIFO
b. Capacity Scheduler
c. Fair Scheduler
The default scheduler in Hadoop Gen1 is FIFO and in Hadoop Gen2, it’s
Capacity Scheduler. FIFO scheduler is something which does not require any
specific configuration. Its role is to maintain the queue of all job submissions to
be done and to run the job one by one, on the first come first serve basis. FIFO
scheduler is not designed for the multi-tenant Hadoop environments. For multi-
tenant environments, there exists a CapacityScheduler and a Fair Scheduler. In
our book, we will explore Capacity and Fair Scheduler configurations.
Before implementing schedulers, let’s first create multiple users and groups in
the system.
Groups Users
Default user1
Sales user2
Analytics user3
Solution
Step 1: Create Groups
sudo groupadd default
sudo groupadd sales
sudo groupadd analytics
Step 2: Create users and link the same with their groups sudo useradd user1 -G
default sudo useradd user2 -G sales
sudo useradd user3 -G analytics Understanding and Implementing
Capacity Scheduler Capacity Scheduler allows multiple users to securely
share a large cluster in terms of resource allocation based on the constraints
set by the administrator. The core abstraction entity of capacity scheduler is
queues. It’s the admin’s responsibility to create the queue as per the
business requirement to use the cluster resources.
Lab 22 Setting up Capacity Scheduler in YARN
diagram below:
Step 1: Ensure that the Hadoop services are live and active Step 2: Configure
CapacityScheduler in yarn-site.xml <property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Ca
</property>
Step 3: Configure scheduler queue in the capacity-scheduler.xml as per the table
shown below: vi /home/hadoop/hadoop2/etc/hadoop/capacity-scheduler.xml
<configuration>
<!-- Initial Configuration --> <property>
<name>yarn.scheduler.capacity.maximum-
applications</name> <value>10000</value> </property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-
percent</name> <value>0.1</value> </property>
<property>
<name>yarn.scheduler.capacity.resource-
calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalcu
</property>
<!-- Creating 3 queues i.e. default, sales and analytics -->
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default,sales,analytics</value> </property>
<!-- Setting default queue parameters --> <property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>40</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.user-limit-
factor</name> <value>1</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.maximum-
capacity</name> <value>100</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.state</name>
<value>RUNNING</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
<value>*</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_administer_queue</
<value>*</value> </property>
<!-- End for default queue --> <!-- Setting sales queue parameters
--> <property>
<name>yarn.scheduler.capacity.root.sales.capacity</name>
<value>20</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.user-limit-
factor</name> <value>1</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.maximum-
capacity</name> <value>100</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.state</name>
<value>RUNNING</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.acl_submit_applications</
<value>*</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.acl_administer_queue</na
<value>*</value> </property>
<!-- End for sales queue --> <!-- Setting analytics queue
parameters --> <property>
<name>yarn.scheduler.capacity.root.analytics.capacity</name>
<value>40</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.user-limit-
factor</name> <value>1</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.maximum-
capacity</name> <value>100</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.state</name>
<value>RUNNING</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.acl_submit_applications</
<value>*</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.acl_administer_queue</na
<value>*</value> </property>
<!-- End for analytics queue --> <!—User/Group to Queue
Mapping --> <property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value>g:sales:sales,g:default:default,g:analytics:analytics</value>
</property>
<!-- End for user to queue mapping --> <property>
<name>yarn.scheduler.capacity.queue-mappings-
override.enable</name> <value>false</value>
</property>
<property>
<name>yarn.scheduler.capacity.node-locality-
delay</name> <value>40</value> </property>
</configuration>
Step 4: Refresh the Queue settings in real-time. Perform the following command
in ResourceManager system.
yarn rmadmin -refreshQueues Step 5: Check the list of Queue,
configured in the system.
mapred queue -list
17/07/04 00:42:18 INFO client.RMProxy: Connecting to
ResourceManager at /0.0.0.0:8032
======================
Queue Name : default
Queue State : running
Scheduling Info : Capacity: 40.0, MaximumCapacity: 100.0,
CurrentCapacity: 0.0
======================
Queue Name : sales
Queue State : running
Scheduling Info : Capacity: 20.0, MaximumCapacity: 100.0,
CurrentCapacity: 0.0
======================
Queue Name : analytics
Queue State : running
Scheduling Info : Capacity: 40.0, MaximumCapacity: 100.0,
CurrentCapacity: 0.0
Step 5: Execute an MR application to verify whether the request is managed by
the specific queue or not.
sudo -u user1 hadoop2/bin/hadoop jar
hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.3.jar wordcount /data1/emp /TestOP/WC1
sudo -u user2 hadoop2/bin/hadoop jar
hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.3.jar wordcount /data1/emp /TestOP/WC1
sudo -u user3 hadoop2/bin/hadoop jar
hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.3.jar wordcount /data1/emp /TestOP/WC1
Based on the configuration done, the request generated by user1 must be
managed by default, that by user2 must be managed by sales and the same
generated by user3 must be managed by analytics queue.
Step 1: Ensure that the Hadoop services are live and active.
Step 2: Configure FairScheduler in yarn-site.xml
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fa
</property>
<property>
<name>yarn.scheduler.fair.allocation.file</name>
<value>/home/hadoop/hadoop2/etc/hadoop/fair-
scheduler.xml</value> </property>
Step 3: Create fair-scheduler.xml and configure it for the users with queue and
weight.
vi /home/hadoop/hadoop2/etc/hadoop/fair-scheduler.xml <?xml
version=”1.0”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<allocations>
<queue name=”prod”>
<aclAdministerApps>hadoop</aclAdministerApps>
<aclSubmitApps>hadoop</aclSubmitApps> <minResources>1000
mb,0vcores</minResources> <maxResources>6000
mb,100vcores</maxResources>
<maxRunningApps>20</maxRunningApps>
<weight>1.0</weight> <schedulingPolicy>fair</schedulingPolicy>
</queue>
<queue name=”dev”>
<aclAdministerApps>user1</aclAdministerApps>
<aclSubmitApps>user1</aclSubmitApps> <minResources>1000
mb,0vcores</minResources> <maxResources>6000
mb,100vcores</maxResources>
<maxRunningApps>20</maxRunningApps>
<weight>1.0</weight> <schedulingPolicy>fair</schedulingPolicy>
</queue>
<user name=”hadoop”>
<maxRunningApps>5</maxRunningApps> </user>
<user name=”user1”>
<maxRunningApps>1</maxRunningApps> </user>
<user name=”user2”>
<maxRunningApps>1</maxRunningApps> </user>
<userMaxAppsDefault>5</userMaxAppsDefault> </allocations>
Step 4: Refresh the Queue settings in real-time. Perform the following command
in ResourceManager system.
yarn rmadmin -refreshQueues In case the Queue is not refreshed, you
can restart the ResourceManager service and you will observe in the WebUI
under Schedulers tab that the FairScheduler has been set successfully.
Summary
Introduction
HDFS has two layers. They are Namespace and Block Management layer.
Namespace is responsible for maintaining the directory structure and metadata of
the data that is stored in the storage layer. The block management layer is
responsible for maintaining the DataNode membership information, storage
operations like create, delete, modify, and fetching locations of blocks associated
with the file. It also supports replication and replica placement during the HDFS
write operation.
The biggest problem was that there existed only one namespace and one block
management. Both of these are tightly coupled with each other. If any of the
components were to fail, the entire cluster would go down. The second
predominant concern in this architecture was that only one NameNode system
can exist because of this tight coupling. The third concern was Isolation. Ideally,
any large production cluster (1000+ nodes) is expected to have some multi-
tenant environment so that multiple organizations/groups/teams can share the
cluster resulting in more ROI (Return on Investment). However since there
exists only one NameNode, this is something that is unachievable.
In the above block diagram, there exist two Namespace volumes. They are
nn1.mylabs.com and nn2.mylabs.com that hold their own independent
namespace and block pool and are managed independent of other block pools
available in the cluster. In this setup, when you delete a Namespace, the
associated block pool will also get deleted and the DataNode will delete the
blocks. However, this will not affect the operations of other Namespace
volumes.
Following is the benefit of this kind of setup:
Problem Statement1: Create a 4 node cluster setup from scratch with the
following specification:
Solution
Understand your setup and fill in the table given below. Ensure that in Step 1
(under hosts file configuration) you add the IP address of your machine.
Node Desired IP Address for Example Your Machine/VMs IP
Name Hostname Purpose address
Node1 nn1.mylabs.com 192.168.1.1
Node2 nn2.mylabs.com 192.168.1.2
Node3 dn1.mylabs.com 192.168.1.3
Node4 dn2.mylabs.com 192.168.1.4
Step 1: Setup network Hostname and host file configuration on all 4 machines.
In a real production setup, we use a dedicated DNS server for Hostnames.
However, for our lab setup, we will setup local host files as resolvers. Hadoop
ideally recommends working on Hostname rather than IP addresses for host
resolution.
Perform the following steps in each machine.
sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com nn1.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.
Once the configuration is done, you will need to restart the machines for
Hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
Step 2: Setup SSH password-less setup between your NameNode system and the
DataNode system. In our example setup, we need an SSH setup password-less
configuration between nn1.mylabs.com to the DataNodes participating in the
cluster, and nn2.mylabs.com to the DataNodes participating in the cluster. This is
done so that node1 can contact other nodes to invoke the services. The reason for
this kind of configuration is that the NameNode system is the single point of
contact for all the users and administrators.
Perform the following commands on nn1.mylabs.com and nn2.mylabs.com:
We will be using the above created federated cluster for this lab.
Step 1: Create two folders in the root location of HDFS named prod and test in
nn1.mylabs.com and nn2.mylabs.com respectively as shown below: On
nn1.mylabs.com,
hdfs dfs -mkdir hdfs://nn1.mylabs.com:9000/prod On
nn2.mylabs.com,
hdfs dfs -mkdir hdfs://nn2.mylabs.com:9000/test Step 2: Stop the
HDFS service
stop-dfs.sh
Step 3: Setup core-site.xml which will hold the configurations for viewFS. Once
configured, replicate the same in all other nodes participating in the cluster.
On nn1.mylabs.com,
vi /home/hadoop/hadoop2/etc/hadoop/core-site.xml <!-- Append
the following properties within the configuration tag -->
<property> <name>fs.default.name</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./production</name>
<value>hdfs://nn1.mylabs.com:9000/prod</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./testing</name>
<value>hdfs://nn2.mylabs.com:9000/test</value>
</property>
Save the file. Now copy this file in other nodes
On nn1.mylabs.com,
scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@nn2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@dn1.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@dn2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml Step 4: Start HDFS services
start-dfs.sh
Step 5: Check whether viewFS is working or not,
hdfs dfs -ls /production
hdfs dfs -mkdir /production/demo1
hdfs dfs -mkdir /testing/demo2
hdfs dfs -ls /testing
So, you will observe that physically, demo1 folder will be placed and
maintained by nn1.mylabs.com and demo2 will be maintained by
nn2.mylabs.com. However using viewFS, we get a transparency.
The diagram below depicts the existing cluster. You can create your 4 node
Gen1 cluster by referring my blog @ http://bigdataclassmumbai.com
Step 1: Ensure that there are no existing pending upgrades active on the system.
Please note that your Hadoop HDFS services are online and running before
performing the following command: hadoop version (The output should be
hadoop-1.2.1) hadoop dfsadmin -upgradeProgress status Step 2: Maintain the
report of the current HDFS
hadoop fsck / -files -blocks -locations > dfs-v-old-fsck-1.log
hadoop dfs -lsr / > dfs-v-old-lsr-1.log hadoop dfsadmin -report >
dfs-v-old-report-1.log Step 3: Stop Hadoop services
stop-all.sh
Step 4: Install Hadoop 2.7.3 as per the instruction given in Setting up Hadoop
in Distributed Mode (Multinode Cluster) in Hadoop Installation and
Deployment chapter from Step 3 to Step 8. Ensure you give different port
numbers in fs.default.name and other parameters. The reason why we are
installing hadoop-2.7.3 is to further perform rolling upgrade where we will
upgrade from 2.7.3 to 2.8.1.
Step 5: Start the Upgrade Process
hadoop version
Ensure the version shown is 2.7.3. If you are using PuTTY or a similar
terminal client, ensure you restart the session and then retry. Once you ensure the
version of Hadoop is correct, initiate the upgrade using the following command:
hadoop-daemon.sh start namenode -upgrade This step will only upgrade the
index and not the block of the metadata. Hadoop will start the upgrade and the
NameNode service in Safemode Step 6: Perform the following command:
hadoop dfs -lsr / > dfs-v-new-lsr-0.log Compare it with the dfs-v-new-
lsr-0.log file to ensure none of the metadata is lost. Once confirmed, you can
start the HDFS service to ensure data block reports are in sync with the new
cluster.
Step 7: Start the HDFS cluster
start-dfs.sh
Note that your NameNode service will already be running. You will observe
that the DataNodes start exchanging the block information with the NN. The NN
will be out of Safemode within 30 seconds subject to the network bandwidth.
Perform the relevant checks.
hdfs dfsadmin -report > dfs-v-new-report-1.log hdfs dfs -lsr / >
dfs-v-new-lsr-1.log Compare this with the old files.
Step 8: Once assured that all your files are available and in compliance with the
previous data structure, finalize the upgrade using hdfs dfsadmin -
finalizeUpgrade
Summary
What is Zookeeper?
Zookeeper in terms of Hadoop and its Ecosystem
How Zookeeper Ensemble initialize and work
Setting up Zookeeper in Standalone mode
Setting up Zookeeper in Leader-Follower mode
Reading and Writing in Zookeeper
a. Configuration Management
b. Naming Service
c. Synchronization
d. Group Services
This chapter is set to explain the basics of Zookeeper and how it works at a
basic level. We will also see the uses of Zookeeper in Hadoop HA setup and
HBase setup.
Let us now start with a basic understanding of Zookeeper design in a nutshell.
Zookeeper Design
The following diagram shows the working of a typical Zookeeper cluster:
This setup contains a 3 node Zookeeper cluster. Zookeeper can be configured
to work in two modes:
When we start the services in all Zookeeper servers, the service will initiate an
election with each other to figure out which Zookeeper server in the Ensemble
will become the leader. The leader will perform the write operation and will also
broadcast the write in multiple Zookeeper servers. The election is done using the
port 3888 (ideal). This port configuration is done in server.x parameter as
defined in Step 3 of Problem Statement 1. An example of the server
configuration is as follows: server.1=node1.mylabs.com:2888:3888
Where:
2888 is meant for internal broadcast communications between the leader
Zookeeper Server to multiple follower Zookeeper Servers.
3888 is the port defined for Leader Election.
When a client connects to Zookeeper, it connects through clientPort, which is
by default 2181. Using this port, the client will write the data in the Zookeeper
in-memory store which will be saved as a snapshot and logged in the persistent
storage as specified in dataDir property.. If the write is done in leader, the leader
will broadcast the same to all followers. If the client connects to a follower and
performs write, the follower will forward the same to the leader and the leader
will perform the task of broadcasting information.
In case the leader dies due to a planned or unplanned event, the other
Zookeeper servers will initiate the election to choose another leader. Technically,
Zookeeper data is highly available due to broadcasting and is also fault-tolerant.
This is possible since another server will take over the charge if the current
leader dies.
Step 1: You may use your MultiNode Hadoop cluster for performing this
practically. Download zookeeper on node1.mylabs.com using the following
command. By default the tar file will be installed in/home/hadoop location.
wget http://www-us.apache.org/dist/zookeeper/zookeeper-
3.4.6/zookeeper-3.4.6.tar.gz
Step 2: Extract the tar file in the home folder in node1.mylabs.com.
tar -xvzf zookeeper-3.4.6.tar.gz
Step 3: Setup Environment Variable for Zookeeper in node1.mylabs.com.
vi /home/hadoop/.bashrc
#Add the following lines at the beginning of the file.
export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.4.6
export PATH=$PATH:$ZOOKEEPER_HOME/bin
Once you save the .bashrc file, update the environment variable file using the
following command: exec bash
Step 4 : Configure Zookeeper to work on Ensemble mode in node1.mylabs.com
vi /home/hadoop/zookeeper-3.4.6/conf/zoo.cfg tickTime=2000
clientPort=2181
initLimit=5
syncLimit=2
dataDir=/home/hadoop/zookeeper-3.4.6/data
dataLogDir=/home/hadoop/zookeeper-3.4.6/logs
server.1=node1.mylabs.com:2888:3888
server.2=node1.mylabs.com:2888:3888
server.3=node1.mylabs.com:2888:3888
Step 5: Create necessary folders for Zookeeper and assign ownership.
mkdir -p /home/hadoop/zookeeper-3.4.6/data
mkdir -p /home/hadoop/zookeeper-3.4.6/logs Here, the data folder
will be used to store all snapshots of the Zookeeper in-memory store and the
data of the Zookeeper. The logs folder will be used to maintain transaction
logs of the updates done in the in-memory store of the Zookeeper.
Step 6: Create myid file in data folder (value for myid will be 1 for us).
vi /home/hadoop/zookeeper-3.4.6/data/myid 1
We have configured server.1 in zoo.cfg file. Here, the myid file represents the
unique id of the Zookeeper server. The myid file is stored in the data folder of
the Zookeeper servers.
Step 7: Once you have completed all steps till Step 6, copy this entire setup in
the remaining two nodes (node2.mylabs.com and node3.mylabs.com). This will
ensure that you do not redo all steps in the other system. Also, this step will
ensure SSH communication between nodes is working well. SSH
communication, however, has no relation to Zookeeper. Perform the following
command in node1.mylabs.com: scp -r /home/hadoop/zookeeper-3.4.6
hadoop@node2.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/zookeeper-3.4.6
hadoop@node3.mylabs.com:/home/hadoop/.
Step 8: Let us make necessary changes/configurations in node2.mylabs.com and
node3.mylabs.com
node2.mylabs.com
a. Setup .bashrc file with Zookeeper environment variables.
vi /home/hadoop/.bashrc
#Add the following lines at the beginning of the file.
export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.4.6
export PATH=$PATH:$ZOOKEEPER_HOME/bin Once the
changes are saved, update the environment variable using the following
command exec bash
b. Make changes in the myid file present in data folder of Zookeeper.
vi /home/hadoop/zookeeper-3.4.6/data/myid 2
node3.mylabs.com
A. Setup .bashrc file with Zookeeper environment variables.
vi /home/hadoop/.bashrc
#Add the following lines at the beginning of the file.
export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.4.6
export PATH=$PATH:$ZOOKEEPER_HOME/bin Once the
changes are saved, update the environment variable using the following
command exec bash
B. Make changes in the myid file present in data folder of Zookeeper.
vi /home/hadoop/zookeeper-3.4.6/data/myid 3
Step 9: Start Zookeeper Service in all 3 nodes manually using the following
command: zkServer.sh start
Once started, check whether the service has started or not. To do so type: jps
If you get QuorumPeerMain service in the list, it means that the Zookeeper
service is running. To check the states of Zookeeper, perform the following
command: zkServer.sh status
If it shows the following, it means your configuration is successful. One
system will show the mode as a leader while other two systems will show the
mode as followers.
JMX enabled by default
Using config: /home/hadoop/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: leader
Working with Zookeeper CLI
Lab 29 Running basic commands in Zookeeper CLI
Introduction
When it comes to Apache Hadoop Cluster, the NameNode system is a single
point of failure. If NameNode and ResourceManager services fail, the entire
cluster goes down. Neither the storage nor the processing request will be
accepted by the cluster. Thus, to make the cluster highly available, Apache
Hadoop introduced NameNode HA or HDFS HA and ResourceManager HA
where NameNode/HDFS HA protect the storage services and ResourceManager
HA protects the processing and cluster management services from known and
unknown disasters.
In production, the most preferred method is the use of the Quorum Journal
Manager. However, in this book, we will explore both methods.
There are two ZKFC processes in the cluster, each residing on individual
NameNodes. ZKFC is a Zookeeper client process that uses Zookeeper to
maintain the session information and state information of the NameNode
(Active). ZKFC also initiates state transitions and fencing when performing
failovers. To learn more about ZKFC, its design and working, refer the below
documentation link:
https://issues.apache.org/jira/secure/attachment/12519914/zkfc-design.pdf
Always remember that Fencing gets invoked whenever the Active NameNode
is not reachable in-terms of network (heartbeat). To monitor this instance and to
invoke Fencing, a watchdog service i.e. ZKFC is used. It is the ZKFC that
triggers fencing so that there are no existing split-brain scenarios.
In Hadoop HDFS HA, we can implement two types of Fencing:
HDFS HA Parameters
Let us understand the ideal parameters responsible for setting of HDFS HA.
Following parameters must always be set in hdfs-site.xml with example property
values.
The following parameters must be set in core-site.xml with example property
values:
Solution
Understand your initial setup and fill in the table given below. Ensure that in
Step 1 (under hosts file configuration) you add the IP address of your respective
machine.
Node IP Address for Example Your Machine/VMs IP
Desired Hostname
Name Purpose address
Node1 node1.mylabs.com 192.168.1.1
Node2 node2.mylabs.com 192.168.1.2
Node3 node3.mylabs.com 192.168.1.3
Node4 node4.mylabs.com 192.168.1.4
Step 1: Setup network Hostname and host file configuration on all 4 machines.
In a real production setup, we use a dedicated DNS server for Hostnames.
However, for our lab setup, we will setup local host files as resolvers. Hadoop
ideally recommends working on Hostname rather than IP addresses for host
resolution.
Perform the following steps in each machine:
sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com node1.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.
Once the configuration is done, you will need to restart the machines for
Hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
Step 2: Setup SSH password-less setup between the NameNode system and the
other system. In our example setup, we need an SSH setup password-less
configuration between node1.mylabs.com and the other nodes participating in
this cluster. We do this so that the node1 can contact other nodes to invoke all of
the services. The reason for this kind of configuration is that the NameNode
system is the single point of contact for all users and administrators.
Perform the following commands on node1:
Step 14: Stop all services from node1.mylabs.com. Don’t worry if you see any
warning or error messages. This will ensure and inform if the service requires a
stop command or not.
stop-all.sh
Step 15: Start HDFS service at cluster from node1.mylabs.com. Before starting,
ensure node1.mylabs.com, node2.mylabs.com, and node3.mylabs.com has
Zookeeper service running in Leader-Follower Configuration. This can be
achieved by performing the following command: start-dfs.sh
Following is the expected service layout for each node:
Solution
Ensure that you have setup the HDFS HA cluster as per the solution in
Problem Statement 1. Ensure that all services are live and active. Now perform
the following steps: Step 1: On node1.mylabs.com, configure yarn-site.xml with
the following parameters:
vi /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-
services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cognitocluster</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>node1.mylabs.com</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>node2.mylabs.com</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>node1.mylabs.com:2181,node2.mylabs.com:2181,node3.myla
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>node1.mylabs.com:9026</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>node2.mylabs.com:9026</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZK
</property>
<property>
<name>yarn.client.failover-proxy-provider</name>
<value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxy
</property>
Now copy the same file in node2.mylabs.com, node3.mylabs.com, and
node4.mylabs.com:
scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml Step 2: Setup mapred-site.xml. This configuration is responsible
for maintaining the MapReduce job configurations. We will configure MR
jobs to run on top of YARN. This file is not available by default. However a
template file is provided. Let us now configure mapred-site.xml: cp
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml.template
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml vi
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 3: Start ResourceManager manually in node1.mylabs.com and
node2.mylabs.com.
yarn-daemon.sh start resourcemanager
Step 4: To check the service state of the ResourceManager (active/standby), type
the following command: yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2
Testing ResourceManager Failover
We can check whether the ResourceManager Automatic Failover is functional or
not, by killing the process in node1.mylabs.com assuming that
node1.mylabs.com holds Active ResourceManager. You can get the process id
using jps command. To kill the process, type: kill -9 <process_id>
e.g. kill -9 5878
After killing the process, type the following command to check the state. If the
state is transitioned from standby to active, then it means your failover
mechanism is working and the configuration is successful.
yarn rmadmin -getServiceState rm2
Implementing HDFS and YARN HA in an Existing
non-HA Cluster
Lab 32 Configuring HDFS and YARN ResourceManager HA in An Existing
Non-HA Enabled Cluster without Any Data Loss Problem Statement 3
– Enabling HDFS and YARN HA on an existing 4 node non-HA cluster.
Step 10: Stop all services from node1.mylabs.com. Don’t worry if you see any
warning or error messages. This will ensure and inform if the service requires a
stop command or not.
stop-all.sh
Step 11: Start HDFS service at cluster from node1.mylabs.com. Before starting,
ensure that node1.mylabs.com, node2.mylabs.com, and node3.mylabs.com has
Zookeeper service running in Leader-Follower Configuration. This can be
achieved by performing the following command: start-dfs.sh
Step 12: Start ResourceManager manually in node1.mylabs.com and
node2.mylabs.com.
yarn-daemon.sh start resourcemanager
Step 13: Start NodeManager manually in node1.mylabs.com using the following
command:
yarn-daemons.sh start nodemanager
Following is the expected service layout for each node.
In this section, we will configure a federated HA-enabled cluster. For our hands-
on exercise we will create a 5 node cluster with the following specification.
Team Node Name HostName Services Running
DBA Node1 dba1.mylabs.com NameNode(active)
DFSZKFailoverController
QuorumPeerMain
JournalNode
DBA Node2 dba2.mylabs.com NameNode(standby)
DFSZKFailoverController
QuorumPeerMain
JournalNode
Analytics Node3 analytics1.mylabs.com NameNode(active)
DFSZKFailoverController
QuorumPeerMain
JournalNode
Analytics Node4 analytics2.mylabs.com NameNode(standby)
DFSZKFailoverController
Shared HDFS Node5 dn1.mylabs.com DataNode
JournalNode
Scenario: We need to create a federated HA-enabled cluster such that the same
cluster can be shared to two project teams viz.
a. DBA Team
b. Analytics Team
The intention of this cluster is to re-use the storage for multiple teams with
separation of metadata. Also we need to ensure the single point of failure
problem must be removed by introducing Standby NameNode resulting HA-
enabled environment.
Solution
A typical Federated HA enabled cluster looks something like this in a
production setup.
In the figure above, we have two set of Namenodes, one for DBA and second
for Analytics Team. As shown above, each team has one active and one standby
Namenodes which ensures single point of failure is removed. However for our
setup we will create something similar to the one shown below:
Lets start. Understand your setup and fill in the table given below. Ensure that
in Step 1 (under hosts file configuration) you add the IP address of your
machine.
Node IP Address for Your Machine/VMs
Desired Hostname
Name Example Purpose IP address
Node1 dba1.mylabs.com 192.168.1.1
Node2 dba2.mylabs.com 192.168.1.2
Node3 analytics1.mylabs.com 192.168.1.3
Node4 analytics2.mylabs.com 192.168.1.4
Node5 dn1.mylabs.com 192.168.1.5
Step 1: Setup network Hostname and host file configuration on all 4 machines.
In a real production setup, we use a dedicated DNS server for Hostnames.
However, for our lab setup, we will setup local host files as resolvers. Hadoop
ideally recommends working on Hostname rather than IP addresses for host
resolution.
Perform the following steps in each machine.
sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com node1.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.
Once the configuration is done, you will need to restart the machines for
Hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
Step 2: Setup SSH password-less setup between your NameNode system and the
other system. In our example setup, we need an SSH setup password-less
configuration between node1.mylabs.com and the other nodes participating in
the cluster. This is done so that the node1 can contact other nodes to invoke the
services. The reason for this kind of configuration is that the NameNode system
is the single point of contact for the users and administrators.
Perform the following commands on node1:
Step 15: You can check the WebUI of each namenode to check the active and
standby namenode of each department.
http://dba1.mylabs.com:50070
http://dba1.mylabs.com:50070
http://analytics1.mylabs.com:50070
http://analytics2.mylabs.com:50070
Performing HDFS Rolling Upgrade from Hadoop-2.7.3 to
Hadoop-2.8.1
Lab 34 Performing Rolling Upgrade from Hadoop-2.7.3 to Hadoop-2.8.0 in
an Existing 4 node HDFS and YARN RM HA-Enabled Cluster
Problem Statement: Upgrade your existing hadoop-2.7.3 cluster into
hadoop-2.8.1 cluster with the following specification.
After Upgrade
Node Hostname Existing Cluster (hadoop-2.7.3)
(hadoop-2.8.1)
node1.mylabs.com NameNode(active) NameNode(active)
DFSZKFailover DFSZKFailover
Controller Controller
QuorumPeerMain QuorumPeerMain
JournalNode JournalNode
node2.mylabs.com NameNode(standby) NameNode(standby)
DFSZKFailover DFSZKFailover
Controller Controller
QuorumPeerMain QuorumPeerMain
JournalNode JournalNode
node3.mylabs.com DataNode DataNode
QuorumPeerMain QuorumPeerMain
JournalNode JournalNode
node4.mylabs.com DataNode DataNode
Step 1: Perform the following command in node1.mylabs.com and
node2.mylabs.com (system holding active and standby namenode) hdfs
dfsadmin -rollingUpgrade prepare
This command creates fsimage for rollback and prepares the Namenode
systems for rolling upgrade. You can verify the status of rollingUpgrade either in
WebUI of namenode or using the following command: hdfs dfsadmin -
rollingUpgrade query
Step 2: Once the Namenode system is enabled for rolling upgrade, on the
standby Namenode, stop all hdfs-associated services and perform hadoop
upgrade. In this example, I am assuming node2.mylabs.com holds the standby
namenode service. So, the following commands will be performed in
node2.mylabs.com, hadoop-daemon.sh stop namenode
hadoop-daemon.sh stop zkfc
hadoop-daemon.sh stop journalnode
Now install hadoop-2.8.1 in node2.mylabs.com as explained below:
Once you complete the above steps start the datanode and journalnode service
in node3.mylabs.com hadoop version (Ensure its 2.8.1)
hadoop-daemon.sh start journalnode
hadoop-daemon.sh start datanode
Step 8: Now upgrade the second datanode. Perform the following command in
node4.mylabs.com hdfs dfsadmin -shutdownDatanode
node4.mylabs.com:50020 upgrade Now install hadoop-2.8.1 in
node4.mylabs.com as explained below:
Once you complete the above steps start the datanode and journalnode service
in node3.mylabs.com hadoop version (Ensure its 2.8.1)
hadoop-daemon.sh start datanode
Step 9: Now since we have upgraded each machine successfully and all services
and HDFS operations are live and active, you can now finalize the upgrade. To
do so perform the following command in node1.mylabs.com hdfs dfsadmin -
rollingUpgrade finalize
Thus you are now able to successfully perform rolling upgrade!
Summary
a. Metastore Server
b. Storage Layer
c. Processing Engine
Solution: Even though we are writing steps for a single node cluster, we can
perform the same even in a Multinode cluster by installing Hive either in
NameNode system or the Edge server of your cluster.
Installation Steps
Step 1: Ensure Hadoop is installed and configured.
Step 2: Install MySQL in Ubuntu.
sudo apt-get install mysql-server During the installation process, you
may be asked to set the password for the root user. For usability, let us
assume that the password set is ‘123456’ since we will be using it in hive-
site.xml later.
Step 3: Download and extract the Apache Hive binaries
wget http://www-eu.apache.org/dist/hive/hive-1.2.2/apache-hive-
1.2.2-bin.tar.gz
tar -xvzf apache-hive-1.2.2-bin.tar.gz Step 4: Rename the extracted
folder (for convenience purposes only).
mv apache-hive-1.2.2 hive
Step 5: Inform the system about Hive by setting all the necessary environment
variables vi /home/hadoop/.bashrc
export HIVE_PREFIX=/home/hadoop/hive export
PATH=$PATH:$HIVE_PREFIX/bin
export PATH=$PATH:$HIVE_PREFIX/conf Now Save it!!!
Step 6: Update the bash and change the ownership of the folders exec bash
Step 7: Configure hive-env.sh.
vi /home/hadoop/hive/conf/hive-env.sh export
HADOOP_HOME=/home/hadoop/hadoop2
export HIVE_CONF_DIR=/home/hadoop/hive/conf Step 8:
Configure hive-site.xml.
vi /home/hadoop/hive/conf/hive-site.xml <?xml version=”1.0”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoopvm:3306/hivemetadb?
createDatabaseIfNotExist=true</value> <!-- Here
hadoopvm is the hostname of the system where MySQL is
installed --> </property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value> </property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<!-- Here root is the username of MySQL Server --> </property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<!-- Here 123456 is the password of MySQL Server -->
</property>
</configuration>
Step 9: Configure the Metastore server to accept remote connections from any
host. This step is one of the most important steps, especially when configuring
Hive in a MultiNode cluster or in an Edge server.
Hive Server2 internally creates different Drivers for each session that then
converses with the Metastore to fetch the metadata. HiveServer2 allows you to
interact with Hive using JDBC, ODBC, Beeline, and so on. Let us now use
HiveServer2 in our setup.
2 rows selected (1.448 seconds) So, you now know how to interact with
Hive using HiveServer2 and Beeline. If you wish to exit Beeline the
command to be typed is ‘!q’ Refer JIRA
https://issues.apache.org/jira/browse/HIVE-4557
Kerberos
Property Name Property Value
hive.server2.authentication KERBEROS
hive.server2.authentication.kerberos.principal hive/_HOST@YOUR_REALM.COM
hive.server2.authenication.kerberos.keytab /home/hadoop/hive.keytab
LDAP
Property Name Property Value Explanation
hive.server2.authentication LDAP This
parameter is
used to set
LDAP as your
authentication
mechanism
hive.server2.authentication.ldap.url LDAP_URL This is to set
the LDAP
server URL
hive.server2.authenication.ldap.Domain ActiveDirectory_Domain_address This is used if
you use
Active
Directory as
your LDAP
Authentication
mechanism
hive.server2.authentication.ldap.baseDN OpenLDAP_baseDN This is used
when you
implement
LDAP using
OpenLDAP
Solution
Step 1: Stop HiveServer2. To do so, kill the process
kill -9 <processed_hiveserver2>
e.g. kill -9 3659
Step 2: Setup hive-site.xml to enable PAM. Append the following properties in
existing hive-site.xml under <configuration> tab.
vi /home/hadoop/hive/conf/hive-site.xml <property>
<name>hive.server2.authentication</name>
<value>PAM</value>
</property>
<property>
<name>hive.server2.authentication.pam.services</name>
<value>login,sshd</value> </property>
Step 3: Configure JPAM. This is one of the most important steps. Failure to do
this will result in errors during the authentication. Download JPAM 64bit. If
your machine is 32bit then use 32bit JPAM package.
Step 4: Extract the package JPam-Linux_amd64-1.1.tgz
tar -xvf JPam-Linux_amd64-1.1.tgz Step 5: Copy libjpam.so in the lib
folder of Hadoop. Technically we must copy the same to java.library.path.
You can figure out the same by doing the following command ps -ef | grep
10096
hadoop 10096 4466 13 21:42 pts/2 00:00:07 /usr/lib/jvm/java-7-
oracle//bin/java -Xmx256m -
Djava.library.path=/home/hadoop/hadoop2/lib -
Djava.net.preferIPv4Stack=true - …
where 10096 is the process id of HiveServer2
In the above output we figured out the location of java.library.path is
/home/hadoop/hadoop2/lib. Copy libjpam.so in that location.
Step 6: Start HiveServer2 service
hiveserver2 &
Step 7: Open Beeline and try connecting it without entering credentials.
!connect jdbc:hive2://hadoopvm:10000
Connecting to jdbc:hive2://hadoopvm:10000
Enter username for jdbc:hive2://hadoopvm:10000: Enter
password for jdbc:hive2://hadoopvm:10000: 17/06/29 21:48:02
INFO jdbc.Utils: Supplied authorities: hadoopvm:10000
17/06/29 21:48:02 INFO jdbc.Utils: Resolved authority:
hadoopvm:10000
17/06/29 21:48:02 INFO jdbc.HiveConnection: Will try to open
client transport with JDBC Uri: jdbc:hive2://hadoopvm:10000
17/06/29 21:48:06 INFO jdbc.HiveConnection: Could not open
client transport with JDBC Uri: jdbc:hive2://hadoopvm:10000
17/06/29 21:48:06 INFO jdbc.HiveConnection: Transport Used for
JDBC connection: null Error: Could not open client transport
with JDBC Uri: jdbc:hive2://hadoopvm:10000: Peer indicated
failure: Error validating the login (state=08S01,code=0) If you can
view the above error, “Error validating the login”, it means that your setup
is successful and the connection is now secure. Let us now try to enter the
OS username and its password which in our case is ‘hadoop’ and ‘123456’.
!connect jdbc:hive2://hadoopvm:10000
Connecting to jdbc:hive2://hadoopvm:10000
Enter username for jdbc:hive2://hadoopvm:10000: hadoop Enter
password for jdbc:hive2://hadoopvm:10000: ******
17/06/29 21:48:15 INFO jdbc.Utils: Supplied authorities:
hadoopvm:10000
17/06/29 21:48:15 INFO jdbc.Utils: Resolved authority:
hadoopvm:10000
17/06/29 21:48:15 INFO jdbc.HiveConnection: Will try to open
client transport with JDBC Uri: jdbc:hive2://hadoopvm:10000
Connected to: Apache Hive (version 1.2.2) Driver: Spark Project
Core (version 1.6.3) Transaction isolation:
TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://hadoopvm:10000> where 10096 is the process id of
HiveServer2
You will see that this time, the authentication is successful and you can now
access Beeline to perform HQL operations.
What is HBase.
HBase v/s HDFS.
HBase v/s RDBMS.
HBase Architecture, Daemons and Components.
Installing SingleNode HBase cluster.
Installing Single-Master MultiNode HBase cluster.
Installing Multiple-Master MultiNode HBase cluster.
Introducing HBase Shell.
Common HBase Admin Commands.
Hive-HBase Integration.
Bulk Loading data in HBase using Hive.
When we see an ideal HBase model, all the data is stored in columns. That is
why HBase is known as a columnar database. In the previous command, we
created a table and a column-family. The column-family holds all dynamic
columns. Also in HBase, the row key is an essential entity. Ideally, whichever
field contains unique data, can act as row key. Think of a Primary Key field. Let
us see how to insert this record in HBase.
hbase(main):002:0> put
‘bigdataclassmumbai’,1,’employee:fname’,’Prashant’
0 row(s) in 0.1410 seconds
hbase(main):003:0> put
‘bigdataclassmumbai’,1,’employee:lname’,’Nair’
0 row(s) in 0.0050 seconds
hbase(main):004:0> put
‘bigdataclassmumbai’,1,’employee:age’,’30’
0 row(s) in 0.0040 seconds
hbase(main):005:0> put
‘bigdataclassmumbai’,1,’employee:role’,’Engineer’
0 row(s) in 0.0090 seconds
To list the contents of the table:
To check the status of the cluster:
hbase(main):008:0> status
2 servers, 0 dead, 1.5000 average load Common HBase Admin
Commands
Lab 43 Performing Hive-HBase Integration for Data Interaction
Step 1: Ensure Hive and HBase is installed. For this blog, we will include and
write this step keeping in mind a Single Node cluster. However the steps are the
same in a MultiNode cluster also.
Step 2: Ensure Hadoop and HBase services are live and active Step 3: Start
HBase shell and create a table in HBase which we will integrate with Hive
hbase shell
Version 1.3.1, r930b9a55528fe45d8edce7af42fef2d35e77677a, Thu
Apr 6 19:36:54 PDT 2017
hbase(main):001:0> create ‘hr_hbase’,’employee’
0 row(s) in 1.4930 seconds
=> Hbase::Table - hr_hbase
Step 4: Let us insert some data in the employee column family. For this scenario,
we will assume empid will be the row key.
2 row(s) in 0.0360 seconds
hbase(main):013:0> exit
Step 5: Start Hive. Create an external table and link the same with the HBase
table.
hive
hive> show databases;
OK
bigdataclassmumbai
default
Time taken: 0.02 seconds, Fetched: 2 row(s)
OK
Time taken: 0.759 seconds
Step 6: Since we have already entered data in HBase, let us check to confirm
whether the same data is reflected in Hive.
hive> show tables;
OK
hr_hive
Time taken: 0.037 seconds, Fetched: 1 row(s) hive> select * from
hr_hive;
OK
Step 2: Now insert the data from staging table to the actual table that is linked
with Hive.
To check insertion, you can do select * from hr_hive. Also you can check the
same in HBase shell.
Summary
Apache Sqoop
Sqoop Operations
List Databases
List Tables
Codegen
Eval
Sqoop Import from DB to HDFS
Sqoop Import from DB to Hive
Sqoop Import from DB to HBase
Sqoop Export from HDFS to DB
Sqoop Jobs
RDBMS to HDFS
RDBMS to Hive, and
RDBMS to HBase.
Step 1: Download Apache Sqoop tar file and place the same in the home folder
of hadoop user. (/home/hadoop) wget
http://redrockdigimark.com/apachemirror/sqoop/1.4.6/sqoop-
1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
Step 2: Extract the tar file and rename the extracted folder as sqoop.
tar -xvzf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz mv sqoop-
1.4.6.bin__hadoop-2.0.4-alpha.tar.gz Step 3: Add the below entries in
.bashrc file
vi /home/hadoop/.bashrc
#Sqoop Configuration
export SQOOP_PREFIX=/home/hadoop/sqoop export
PATH=$PATH:$SQOOP_PREFIX/bin export
HADOOP_HOME=/home/hadoop/hadoop2
export HADOOP_COMMON_HOME=/home/hadoop/hadoop2
export HADOOP_MAPRED_HOME=/home/hadoop/hadoop2
export HIVE_HOME=/home/hadoop/hive
export HBASE_HOME=/home/hadoop/hbase Step 4: Copy the
MySQL jar file in the lib folder of Sqoop
cp /home/hadoop/mysql-connector-java-5.1.42-bin.jar
/home/hadoop/sqoop/lib/.
Sqoop Installation is done successfully.
Note:- Here I am assuming you have installed MySQL in your system. Incase
you didn’t installed, perform the following command to install MySQL, sudo
apt-get install mysql-server
To start MySQL,
mysql -u root -p123456
Refer https://bigdataclassmumbai/hadoop-book for more details on setting
mysql and loading example databases for Sqoop practice.
a. Select Query Eval: Let’s evaluate emp table with a simple select query
1. Sqoop parses the query provided to ensure that there exist no syntax
errors and all required parameters are satisfied. If the required
parameters are not passed, it may either fail or take the default values.
2. After parsing, Sqoop performs code generation. Code generation is
about generating an MR code DAO which can later be submitted to
YARN to initialize the processing. Sqoop internally uses MR for its
operation.
3. Once the code is generated (technically the JAR file), the JAR is then
submitted to YARN to perform processing. Once processing is done,
the mapper output will then be stored in the HDFS output location
defined in the Sqoop command.
4. If you are performing Hive or HBase Transfer, the above step will be
considered a staging output which will then be transferred to Hive or
HBase depending on the type of Import.
Problem statement: Import all records from emp table whose salary is greater
than 200000
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook
--username root --password 123456 --table emp --target-dir
/sqoop/emp_subset_2L --where “esal > 200000”
Import Using SQL Query
Lab 55 Importing a Table Using SQL Query in Apache Sqoop
If you want to import data from a table using SQL query, you can use --query
parameter.
Problem statement: Import all records from emp table whose salary is greater
than 90000 and who live in Mumbai location.
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook --
username root --password 123456 --target-dir /sqoop/output90000Mumbai -
-query “select * from emp where esal > 90000 and location = ‘Mumbai’ and
\$CONDITIONS” -m 1
Where:
--query will accept my query to be executed and the output of the same will
be emitted in my output location.
Please note that whenever you apply --query parameter, it is mandatory to
specify the where condition with $CONDITIONS as shown in the Sqoop query
above. For the next example, we will use --query again to show how to use it
even when you don’t need to declare the ‘where’ clause in the query.
Problem Statement: Import first two records in HDFS location /sqoop/final with
delimiter as ‘|’ and ensure that there exists only one mapper output.
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook
--username root --password 123456 --target-dir /sqoop/final --
query “select * from emp where \$CONDITIONS LIMIT 2” -m 1
--fields-terminated-by ‘|’
Where:
-m 1 will ensure that there exists only one mapper output --fields-terminated-
by will ensure the delimiter will be ‘|’
Fig 2 – Apache Sqoop Hive Import Operation Flow As shown above, Import to
Hive is a three-step process:
Problem Statement: Import data from Database to Apache Hive directly ensuring
a new table named ‘employee’ gets created inside the database named hr in
Apache Hive and HDFS temporary location/staging area must
be/sqoop/hivetemp2. hr database in Apache Hive exists.
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook
--username root --password 123456 --table emp --hive-import --
create-hive-table --hive-table hr.employee --target-dir
/sqoop/hivetemp2
Where:
--hive-import – Ensure import is a Hive import --create-hive-table – Ensure
a table called ‘employee’ inside hr database is created automatically during
Import process. The name of this table is mentioned in the -–hive-table
parameter.
--hive-table – Ensure the name of the table to be maintained/used during Hive
Import Importing Data from Database to Apache HBase In this section, we
will see how to transfer data from the Database to HBase. In the case of HBase,
the operation is similar to Hive Import, with the only difference being that you
must create a table and a column-family in HBase before performing Import
operation. Also it is the responsibility of the user to determine which field in the
table will be the row key. Usually the row key is the primary key of the table.
However, any field in the database can act as a row key. The only criterion for
the field to be a row key is that it should hold unique values in the column.
Lab 59 Importing Data from Database to HBase
Solution
Step 1: Start HBase shell
HBase Shell
Step 2: Create a table named hrdept with columnfamily employee
hbase(main):001:0> create ‘hrdept’,’employee’
0 row(s) in 1.5500 seconds
=> Hbase::Table - hrdept
hbase(main):002:0> exit
Step 3: Initiate Sqoop Transfer using the following command: sqoop import --
connect jdbc:mysql://hadoopvm:3306/hadoopbook --username root --
password 123456 --table emp --hbase-table hrdept --column-family
employee --hbase-row-key eid -m 1
Where:
--hbase-table is used to define the name of the HBase table.
--column-family is used to specify the name of the column family in which
the data is to be transferred.
--hbase-row-key is used to specify which field in the table should be
considered as the row key for HBase.
Step 4: Check the output in Hbase
hbase(main):001:0> scan ‘hrdept’
Note:- Sqoop provides a parameter called --hbase-create-table parameter which
can create Hbase table and column family. However this will work only when
your Sqoop is compatible with the Hbase version. See the below link for more
details: https://issues.apache.org/jira/browse/SQOOP-2759
Problem Statement: Transfer the data from HDFS to Database using the
following specifications: HDFS file location - /data/sample
Database Name – test
Table Name – hobby
Table Fields – eid(int),ehobby(varchar(30))
Solution
Step 1: Open MySQL and create the relevant database and table mysql -u root -
p123456
mysql> create database test; Query OK, 1 row affected (0.01 sec)
mysql> use test;
Database changed
Query OK, 0 rows affected (0.01 sec) Step 2: Initiate the export
operation.
sqoop export --connect jdbc:mysql://localhost:3306/test --
username root --password 123456 --table hobby --export-dir
/data/sample --input-fields-terminated-by ‘,’ -m 1
Step 3: Check the result in MySQL
mysql> select * from test.hobby;
a. Import Operation
b. Export Operation
Let us see an example to understand how Sqoop Import job works: Lab 61
Creating and Working with Sqoop Job
Step 1: Let us first check if there are any existing and saved Sqoop Jobs sqoop
job --list
Step 2: Let us create a new Sqoop job to transfer data from the Database to
HDFS with the following specifications: Database Name: hadoopbook
Table Name: emp
HDFS location: /sqoop/job1output
sqoop job --create ImportJob1 -- import --connect
jdbc:mysql://hadoopvm:3306/hadoopbook --username root --
password 123456 --table emp --target-dir /sqoop/job1output -m 1
Note: There is a space between -- and Import in the command above. Please
ensure that you put the space to avoid errors.
Step 3: Running ImportJob1
sqoop job --list
Available jobs:
ImportJob1
sqoop job --exec ImportJob1
The above command may ask you to enter the Database password. Enter the
same and the command will execute the created job.
Let us see some more parameters of Sqoop Jobs.
Summary
workflow.xml.
coordinator.xml.
bundle.xml.
The Apache Oozie Client is responsible to query the Oozie Server. The
Apache Oozie client is a CLI tool that the admin/user will use to interact with
the Oozie server. A user will submit an Oozie application via the Apache Oozie
client to the Oozie server.
The Apache Oozie server stores the information in the database and retrieves
all information necessary to run an Apache Oozie application. It then runs the
application in the Hadoop cluster. The Apache Oozie server is the core layer of
the processing in Apache Oozie responsible to manage the Apache Oozie job
scheduling and execution of any Oozie applications that the user/admin submits
to Oozie. The Apache Oozie server runs in a web container like Apache Tomcat.
Apache Oozie server is a stateless server. However, it holds the state and job
related information of the Apache Oozie application in the database. The Apache
Oozie server provides a REST API and Java client so that client can be written in
any language.
Technically, the Apache Oozie Server acts like a client application for Hadoop
cluster. It reads their XML files from HDFS and runs the job on top of the
Hadoop Cluster. It can run MapReduce jobs, Hive, Pig, Sqoop, DistCp, Spark,
and many more.
The database in the Apache Oozie Architecture is used to store all job related
information like, state of the job, reference of the job, the binaries or JARs the
job needs to execute, and so on. Apache Oozie supports Derby, MySQL, Oracle,
MSSQL, HSQL, and PostgreSQL databases. The default database is Derby
which is built-in the Apache Oozie setup. However, in production we prefer to
use other databases.
We will be working on Apache Oozie 4.2.0. Let us now start it step by step:
Part1: Prepare the System for Building Oozie.
Step 1: It requires Apache Maven which can be downloaded from
http://maven.apache.org/download.cgi. Once downloaded, untar the same.
wget http://www-eu.apache.org/dist/maven/maven-
3/3.5.0/binaries/apache-maven-3.5.0-bin.tar.gz
tar -xvzf apache-maven-3.5.0-bin.tar.gz Step 2: Inform system about
Apache Oozie. This can be done by setting up environment variables in
.bashrc file.
vi /home/hadoop/.bashrc
export PATH=$PATH:/home/hadoop/apache-maven-3.5.0/bin Now
Save it !!!
Step 3: Apply the changes.
exec bash
Step 4: Download Apache Oozie and extract the same
wget http://archive.apache.org/dist/oozie/4.2.0/oozie-4.2.0.tar.gz
tar -xvzf oozie-4.2.0.tar.gz
Step 5: Open pom.xml and perform the following,
cd oozie-4.2.0
vi pom.xml
Find the below strings and perform the changes:
a. Edit Java Version. Search for <targetJavaVersion> tag in the file and
change the version of Java to 1.7 assuming you have installed JDK7. In
case you are using JDK8, you could change the below line to 1.8.
Ideally, it will be line 48 of your pom.xml file (if you are using the same
version as the one I am using).
<targetJavaVersion>1.7</targetJavaVersion>
b. Search maven-javadoc-plugin. You may see a tag <artifactId>maven-
javadoc-plugin</artifactId>. Ideally, it will be line 1491 of your
pom.xml (if you are using the same version as the one I am using). Add
the following lines inside the <configuration> tag. Please note that the
<additionalparam> tag is applicable only when you work in JDK8. In
case you are using JDK7, please don’t add the <additionalparam> tag
line.
<javadocExecutable>/usr/lib/jvm/java-7-
oracle/bin/javadoc</javadocExecutable>
<additionalparam>-Xdoclint:none</additionalparam>
c. Search hadoop.version and activate the profile of hadoop-2(true) and the
rest false. In my case, it is line 79 of my pom.xml file that I use for
changing the Hadoop version. Ensure you enter the same Hadoop
version that you have installed. While writing this book, I used Hadoop-
2.7.3 to test Apache Oozie. In case you are using Hadoop-2.8.0, make
the changes accordingly.
<hadoop.version>2.7.3</hadoop.version>
<hadoop.majorversion>2</hadoop.majorversion>
To activate hadoop2 profile, search hadoop-2. In my case, it is line 1788
of my pom.xml. Change the following tags as given below:
<activeByDefault>true</activebyDefault>
<hadoop.version>2.7.3</hadoop.version>
d. Search for Codehaus repository url. We will need to replace the link
with the one given below. In my case, it is line 145 of my pom.xml.
<url>https://repository-
master.mulesoft.org/nexus/content/groups/public/</url>
Now save pom.xml after all the 4 changes are done as explained above.
Let us now test whether Apache Oozie is capable of running a workflow or not.
To do so, we will use the examples provided in the Apache Oozie source code
folder. You can check all examples in the given below location: ls oozie-4.2.0
/examples/target/oozie-examples-4.2.0-examples/examples/apps/
Let us try executing a MapReduce Example (map-reduce)
Step 1: Navigate to the map-reduce folder
cd oozie-4.2.0 /examples/target/oozie-examples-4.2.0-
examples/examples/apps/map-reduce Step 2: Edit job.properties file
with the relevant configurations.
vi job.properties
Edit the lines as shown below:
nameNode=hdfs://hadoopvm:8020
jobTracker=hadoopvm:8032
#Add the below line. Its not there by default in the properties file
oozie.system.libpath=true
Now Save it !!!
Step 3: Upload the examples folder in the home folder of HDFS
cd ../../../
hdfs dfs -put examples examples
This will upload your examples in/home/hadoop/examples HDFS location.
Step 4: Running an Apache Oozie application. Assume your present working
directory is /home/hadoop cd oozie
bin/oozie job -oozie http://hadoopvm:11000/oozie -config
/home/hadoop/oozie-4.2.0/examples/target/oozie-examples-4.2.0-
examples/examples/apps/map-reduce/job.properties -run job:
0000000-170626143054944-oozie-hado-W
You will get a job id similar to the one shown above. You can now check the
status of the job. Please note that after info it is the job id that I have picked from
the previous command output.
oozie job -oozie http://hadoopvm:11000/oozie -info 0000000-
170626143054944-oozie-hado-W
In case there are any errors, you can check the log of the job using: oozie job -
oozie http://hadoopvm:11000/oozie -log 0000000-170626143054944-oozie-
hado-W
Now we can say that our Apache Oozie setup is done successfully.
Creating a MapReduce Workflow
Workflow is the fundamental building block of Apache Oozie that contains a set
of actions. An action does the actual processing in the workflow. An action can
be a Hadoop MR job, Sqoop job, Hive scripts, Pig scripts, Shell scripts, Spark
Jobs, and so on.
Let us now learn how to create a MapReduce workflow. Ideally, as an
administrator, you will be given a JAR file which contains an MR code to be
triggered.
Lab 64 Creating and Running a Mapreduce Workflow
We can now see how to create an Apache Oozie application with a single
workflow.
Step 1: Create a folder that will hold your Apache Oozie application. Assume
your present working directory is /home/hadoop mkdir MROozie
mkdir MROozie/input
mkdir MROozie/lib
As mentioned above, the input folder will have the input file which is to be
processed, and the lib folder will have the MR JAR file which is to be executed.
Step 2: Copy MR job JAR file provided in the Apache Oozie example. This is
the same JAR file that we tried in our YARN chapter.
cp -r /home/hadoop/oozie-4.2.0/examples/target/oozie-examples-
4.2.0-examples/examples/apps/map-reduce/lib/oozie-examples-
4.2.0.jar /home/hadoop/MROozie/lib/.
Step 3: Create a sample input file inside input folder of MROozie.
vi /home/hadoop/MROozie/input/sample.txt
Welcome to Oozie
Welcome to Oozie
Step 4: Create a dedicated directory named oozie where we will maintain all our
applications which will be orchestrated by Apache Oozie.
hdfs dfs -mkdir oozie
hdfs dfs -mkdir oozie/mapreduce
Step 5: Let us understand and create job.properties file
Ideally, this file resides in the local file system that will be used while
launching the Apache Oozie application through the Apache Oozie client. You
can consider this file as an entry point of an Apache Oozie application similar to
main () in Java or C++ program. This file contains all necessary information like
input, output, and other necessary arguments and parameters for Apache Oozie
to invoke the workflow.
vi /home/hadoop/MROozie/job.properties
nameNode=hdfs://hadoopvm:8020
jobTracker=hadoopvm:8032
queueName=default
oozieRoot=oozie
oozie.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${oozieRoot}/map
Following is the explanation for each configuration set in this file:
Property Name Explanation Value
NameNode This parameter is hdfs://hadoopvm:8020
used to set the
NameNode URL
of the Hadoop
cluster. This will
help Apache
Oozie interact
with HDFS for
storage related
operations and to
access applications
residing in HDFS.
JobTracker This parameter is hadoopvm:8032
used to specify
where we submit
the MR
job/application. In
our case we will
specify the default
YARN port
(ResourceManager
port = 8032)
QueueName Used to specify default
the type of queue
to be applied
oozieRoot Used to specify oozie
the root folder
name where the
application will
reside. In our case
we have created
the oozie folder in
the home folder of
HDFS
oozie.system.libpath Specifies to look true
for JARs and
essential libraries
in the sharelib path
oozie.wf.application.path This parameter ${nameNode}/user/${user.name}/${oozieRoot}
specifies the
location of
workflow.xml in
HDFS which is the
actual workflow.
Step 6: let us now write the Apache Oozie application xml code and the
workflow.xml file. This file is the actual operation workflow.
Step 7: Copy the created application in HDFS
hdfs dfs -put /home/hadoop/MROozie/*
/user/hadoop/oozie/mapreduce Step 8: Run the Apache Oozie
Application
oozie job -oozie http://hadoopvm:11000/oozie -config
/home/hadoop/MROozie/job.properties -run Common Commands in
Apache Oozie
Let us explore some of the commonly used Apache Oozie commands
Summary
Apache Pig.
Apache Spark.
Apache Flume.
LOCAL
MAPREDUCE
TEZ_LOCAL
TEZ
SPARK_LOCAL (new inclusion in Pig-0.17.0)
SPARK (new inclusion in Pig-0.17.0)
wget
http://redrockdigimark.com/apachemirror/flume/1.7.0/apache-
flume-1.7.0-bin.tar.gz
Step 2: Extract the tar file
tar -xvzf apache-flume-1.7.0-bin.tar.gz Step 3: Rename the extracted
folder.
mv apache-flume-1.7.0 flume Step 4: Setup the environment variables
in the system.
vi /home/hadoop/.bashrc
#Add the following lines at the start of the file export
FLUME_HOME=/home/hadoop/flume export
PATH=$PATH:$FLUME_HOME/bin Step 5: Update the environment
variables.
exec bash
In this way, the installation process is completed successfully. Now, let’s try an
example to check whether the flume is working or not.
Step 1: Create a configuration file in/home/hadoop location named
myncgrab.conf ncgrab.sources=getfromnetcat
ncgrab.sinks=writetohdfs
ncgrab.channels=gotoram
ncgrab.sources.getfromnetcat.type=netcat
ncgrab.sources.getfromnetcat.bind=192.168.247.137
ncgrab.sources.getfromnetcat.port=9999
ncgrab.channels.gotoram.type=memory
ncgrab.sinks.writetohdfs.type=hdfs
ncgrab.sinks.writetohdfs.hdfs.path=hdfs://hadoopvm:8020/user/hadoop/flum
ncgrab.sinks.writetohdfs.hdfs.writeFormat=Text
ncgrab.sinks.writetohdfs.hdfs.fileType=DataStream
ncgrab.sources.getfromnetcat.channels=gotoram
ncgrab.sinks.writetohdfs.channel=gotoram Step 2: Ensure that the
Hadoop services are live and active, since here the sink is HDFS.
Step 3: Open two terminals, one for initializing netcat listener and other for
flume grab execution For netcat listener type the following command,
nc -l 9999
Step 3: Initiate Flume grab.
flume-ng agent -n ncgrab -f /home/hadoop/ncagent.conf Step 4:
Now start emitting data from netcat listener console. Type some words and
press enter. You will see that all the inputs will be stored in the HDFS
location specified in flume configuration.
Summary