Вы находитесь на странице: 1из 252

Notion

Press
Old No. 38, New No. 6
McNichols Road, Chetpet
Chennai - 600 031
First Published by Notion Press 2017
Copyright © Prashant Nair 2017
All Rights Reserved.
ISBN 978-1-947752-07-8
This book has been published with all reasonable efforts taken to make the
material error-free after the consent of the author. No part of this book shall be
used, reproduced in any manner whatsoever without written permission from the
author, except in the case of brief quotations embodied in critical articles and
reviews.
The Author of this book is solely responsible and liable for its content including
but not limited to the views, representations, descriptions, statements,
information, opinions and references [“Content”]. The Content of this book shall
not constitute or be construed or deemed to reflect the opinion or expression of
the Publisher or Editor. Neither the Publisher nor Editor endorse or approve the
Content of this book or guarantee the reliability, accuracy or completeness of the
Content published herein and do not make any representations or warranties of
any kind, express or implied, including but not limited to the implied warranties
of merchantability, fitness for a particular purpose. The Publisher and Editor
shall not be liable whatsoever for any errors, omissions, whether such errors or
omissions result from negligence, accident, or any other cause or claims for loss
or damages of any kind, including without limitation, indirect or consequential
loss or damage arising out of use, inability to use, or about the reliability,
accuracy or sufficiency of the information contained in this book.
Dedication
To my parents, my teachers, my students and my wife!
Contents

Preface
What This Book Covers?
Lab Exercises Covered in This Book

1. Introducing Bigdata & Hadoop


2. Apache Hadoop Installation and Deployment
3. Demystifying HDFS
4. Understanding YARN and Schedulers
5. HDFS Federation and Upgrade
6. Apache Zookeeper Admin Basics
7. High Availability in Apache Hadoop
8. Apache Hive Admin Basics
9. Apache HBase Admin Basics
10. Data Acquisition using Apache Sqoop
11. Apache Oozie
12. Introducing Pig, Spark and Flume
Preface

Bigdata is a blooming term in market. Wherever we go, we see Bigdata,


whenever we do a Job Search, we see Bigdata openings. Bigdata is a term to
define the capability of your software/hardware/framework/architecture to
handle the data. If your existing architecture fails to load, process or store the
data, we can say that you are facing a Bigdata problem. To solve Bigdata
problems in a cost-efficient manner, solution architects started adopting Hadoop.
There are many reasons why Hadoop is famous. Its cost-efficient, best for batch
processing, capability is proportional to your hardware setup, integration of
multiple processing tools and techniques and many more.
This book is focused on basic to intermediate concepts and hands-on exercises
on Hadoop and its ecosystem components with respect to administering and
managing a typical Hadoop cluster. This book is designed for Linux admins,
windows admins, technical managers, DBAs, technical and solution architects to
say a few. This book begins with Introducing to Hadoop following up by Install,
configure and manage the cluster. We will also cover the common administration
tasks usually done by Hadoop Engineers. We will also cover Hadoop HDFS
High Availability using QJM, Zookeeper in depth, Job Schedulers, Workflow
management using Oozie and many more.
So, whether you want to deep-dive Hadoop or want to understand the bit and
pieces of Hadoop, this book is for you.
What This Book Covers?

Chapter 1, Introducing Bigdata and Hadoop, introduces you with the world of
Bigdata and explores the roles and responsibilities of a Hadoop administrator
Chapter 2, Apache Hadoop Installation and Deployment, deep dive right from
building hadoop-2.8.0 to installing and configuring Hadoop in standalone,
pseudo-distributed and distributed mode.
Chapter 3, Demystifying HDFS, talks in detail about HDFS storage, how it
operates and teaches how to access HDFS layer using CLI and NFS gateway. It
also covers some of the common administration tasks associated with HDFS.
Chapter 4, Understanding YARN and Schedulers, helps reader understanding
YARN internals and best practices while setting up YARN in production cluster.
It also talks about implementing schedulers like Capacity and Fair Schedulers.
Chapter 5, HDFS Federation and Upgrade, helps the reader to understand
concerns of HDFS architecture in terms of multi-tenancy which is addressed
using Federation. It also talks about how to implement HDFS Federation in the
cluster and how to perform HDFS cluster upgrade from Gen1 to Gen2 and
performing rolling upgrade.
Chapter 6, Apache Zookeeper Admin Basics, deals with understanding and
implementing Zookeeper in standalone and leader-follower mode. We also
discuss how to use zookeeper CLI to see the content in Zookeeper filesystem.
Chapter 7, High Availability in Apache Hadoop, deals with understanding how
to overcome the single point of failure of Namenode system. We will be
implementing High Availability using QJM. We will also build and implement
HA on a federated cluster.
Chapter 8, Apache Hive Admin Basics, talks about how hive works, installing
hive with MySQL metastore and using Hiveserver2 with beeline client and
securing the same.
Chapter 9, Apache HBase Admin Basics, deals with understanding HBase
architecture, installing and configuring HBase with single node, single master,
and multi-master setup. After that we will learn some basic HBase shell
commands and admin commands. Lastly, we will learn how to integrate HBase
with Hive and how to perform bulk uploading data in HBase.
Chapter 10, Data Acquisition using Apache Sqoop, talks about how to do data
transfer between RDBMS and Hadoop. We will be learning CLI commands with
some variations to deal with Sqoop’s import and export operation.
Chapter 11, Apache Oozie, deals with how to create and schedule workflow in
Oozie. We will be building, installing and configuring Oozie. Once done we will
learn how to create a simple workflow.
Chapter 12, Installing Apache Pig, Spark and Flume, deals with installation
and configuration of Apache Pig, Apache Spark and Apache Flume.
Lab Exercises Covered in This Book

Chapter Lab
Description
No. No.
2 1 Building Apache Hadoop 2.8.0 from Scratch
Setting up Apache Hadoop 2.8.0 in Standalone Mode (CLI-
2 2
Minicluster Mode)
Setting up Apache Hadoop 2.8.0 in Pseudo-Distributed Mode
2 3
(Single Node Cluster)
Setting up Apache Hadoop 2.8.0 in Distributed Mode (Multinode
2 4
Cluster)
3 5 Working with HDFS Filesystem Shell Commands
3 6 Setting up Replication Factor of an Existing Cluster
Dynamically Setting up Replication Factor During Specific File
3 7
Upload
3 8 Setting ip Block Size in Existing Hadoop Cluster
Adding Nodes in an Existing Hadoop Cluster Without Cluster
3 9
Downtime
Decommissioning Datanode in Existing Hadoop Cluster Without
3 10
Data Loss and Downtime in the Cluster.
3 11 Whitelisting Datanodes in an Hadoop Cluster.
3 12 Working with Safemode (Maintenance Mode) in Hadoop
3 13 Checkpointing Metadata Manually
3 14 Setting up NFS Gateway to Access HDFS
3 15 Setting up Datanode Heartbeat Interval
3 16 Setting up File Quota in HDFS
3 17 Removing File Quota in HDFS
3 18 Setting up Space Quota in HDFS
3 19 Removing Space Quota in HDFS
Configuring Trash Interval in HDFS and Recovering Data from
3 20
Trash
4 21 Creating Multiple Users and Groups in Ubuntu System
4 22 Setting up Capacity Scheduler in YARN
4 23 Setting up Fair Scheduler in YARN
5 24 Setting up HDFS Federation in a 4 Node Cluster
5 25 Implementing ViewFS in Existing 4 Node Federated Cluster
5 26 Performing Hadoop Upgrade from Gen1(1.2.1) to Gen2(2.7.3)
6 27 Setting up Zookeeper in Standalone Mode
6 28 Setting up Zookeeper in Leader-Follower Mode
6 29 Running Basic Commands in Zookeeper CLI
Installing and Configuring 4-Node HDFS HA-Enabled Fresh
7 30
Cluster
Configuring YARN ResourceManager HA in the 4 Node HDFS
7 31
HA-Enabled Cluster
Configuring HDFS and YARN ResourceManager HA in an
7 32
Existing Non-HA Enabled Cluster Without Any Data Loss.
7 33 Building a Federated HA-Enabled Cluster
Performing Rolling Upgrade from Hadoop-2.7.3 to Hadoop-2.8.0
7 34 in an Existing 4 Node HDFS and YARN RM HA-Enabled
Cluster
Setting up Apache Hive with MySQL Database as a Metastore
8 35
Server.
8 36 Connecting Beeline Client to hiveserver2
8 37 Configuring hiveserver2 to Secure Beeline Client Access

8 38 Configuring Hive Credential Store


9 39 Installing and Configuring Apache HBase in Single Node Cluster
Installing and Configuring Apache HBase Single HMaster
9 40
Multinode Cluster
Installing and Configuring Apache HBase Multiple HMaster for
9 41
HA in a Multinode Cluster
9 42 Working with HBase Shell
9 43 Performing Hive-HBase Integration for Data Interaction
9 44 Bulk Loading the Data in HBase Using Apache Hive
9 45 Bulk Loading Delimited File Directly in HBase
10 46 Installing and Configuring Apache Sqoop 1.4.6
10 47 Listing the Databases in MySQL Using Apache Sqoop
10 48 Listing the Tables in a Database in MySQL Using Apache Sqoop
10 49 Generating DAO for a Table Using Apache Sqoop
10 50 Perform Sqoop Eval for Select Query
10 51 Perform Sqoop Eval for Insert Query
Importing a Table from Database Having Primary Key Column
10 52
Using Sqoop
Importing a Table from Database Without a Primary Key Column
10 53
and Specifying a Destination Location in HDFS Using Sqoop
10 54 Importing a Subset of Data from a Table in Apache Sqoop
10 55 Importing a Table Using SQL Query in Apache Sqoop
10 56 Changing the output delimiter during import
10 57 Performing Incremental Import
10 58 Importing Data from Database to Hive
10 59 Importing Data from Database to Hbase
10 60 Perform Sqoop Export Operation from HDFS to Database
10 61 Creating and Working with Sqoop Job
11 62 Building and Installing Apache Oozie
11 63 Testing Apache Oozie with an Example Mapreduce Workflow
11 64 Creating and Running a Mapreduce Workflow
12 65 Installing and Configuring Apache Pig
12 66 Installing and Configuring Apache Spark
Installing and Configuring Apache Flume and Grabbing Data
12 67
from Netcat Source
Chapter 1
Introducing Bigdata & Hadoop

In this chapter you will learn,

What is Bigdata?
IBMs definition on Bigdata
Types of Bigdata
Typical Bigdata Project Phases
Introducing Hadoop
Features of Hadoop
Role of Hadoop Administrator in Bigdata Industry
Creating your lab setup for hands-on exercise

What is Bigdata?
Whenever I start my training on Bigdata Hadoop Development or Hadoop
Administration, the first question I ask my participants is, “WHAT IS
BIGDATA?” It may sound crazy because the participants are here to learn the
same. However, the reason why I ask this question is to understand their mindset
of the Bigdata paradigm. I also want you to take a minute and think what
Bigdata is as per your viewpoint.
If you ask me, Bigdata is a term - not a technology, neither a tool. It’s all about
the ability of the software, or framework, or architecture to handle the data
(Handle refers to LOAD, PROCESS, and STORE/FORWARD data). So during
any execution process, if an application or infrastructure fails due to data
parameter, I can say I am facing a Bigdata problem. Think of it this way: there is
a text file of size 3GB named file1.txt and you want to open the file. If you try
opening it using Notepad, you will observe Notepad will not respond. However,
if you try to open the same file using WordPad, you will see some latency
introduced. But, WordPad will be able to load the data in RAM. I know this
example sounds lame, but try to see the hidden message. In this example,
Notepad faced a Bigdata Problem. Notepad was not able to open the file even
though the system had enough disk space, RAM, and OS support. But when I
opened the same data using WordPad, I was successful because for WordPad it is
just normal data. Thus, any data can be a Bigdata.
IBMs Definition of Bigdata
According to IBM, everyday data which comes from anywhere like the sensors,
social media, GPS etc. is Bigdata. Bigdata spans three dimensions viz. 3 V’s -
Volume, Velocity, and Variety.

Volume: Volume ideally refers to the size of data.


Velocity: Velocity refers to the speed of transfer or update. For
example, the turnaround time required for an author of a post on
Facebook to figure out that his friend clicked the Like Button is
Velocity or Anomaly detection. This can be said as velocity.
Variety: It refers to the type of the data. It can be either structured,
semi-structured, or unstructured data.

Ideally, Bigdata is not just about size. It is about gaining insight. That’s why
IBM also introduced the fourth V i.e. Veracity or Value which is all about getting
some value or insights out of the data.

Types of Bigdata
Any Bigdata that is meant for processing can be categorized as follows:
Structured Data is an organized data consisting of elements that are
addressable during processing and analysis. For example, data stored in the
database can be considered as structured data.
Semi-structured data is also an organized data but it does not follow any
formal structure. It does contain markers or tags to enforce hierarchies of fields
and records within the data. For example, an XML and JSON data is considered
semi-structured data.
Unstructured data is the actual raw data. It does not have a formal structure or
metadata. Usually this kind of data can be converted into a structured data but it
is a tedious and time consuming process. Some of the examples of unstructured
data include text files, images, videos, pdfs, and so on.

Typical Bigdata Project Phases for Batch Processing Paradigm


A typical Bigdata project has 5 phases. They are:

a. Data acquisition phase


b. Data pre-processing phase
c. Data transformation phase
d. Data view phase
e. Visualization/Analytics Phase

as shown in the illustration below:

Let us explore each of them in brief:


The Data Acquisition phase concentrates on getting raw data from multiple
heterogeneous sources. These sources could be in same or different formats and
structures. This data can be stored in the same persistent distributed storage. An
example would be, getting data from databases and smartphones and storing the
same in a common distributed storage in the form of files. This phase is
predominantly the data collection stage.
Data pre-processing phase concentrates on completing pre-processing tasks.
This phase ensures that the raw data is compatible with the processing layer.
This is an essential step most of the times since it ensures your data is in a
common format and can be understood by your processing engine. For example,
assume you received data from databases in the form of CSV and from a
smartphone in the form of XML. Let us assume your processing layer logic can
only read CSV. Here, the role of this phase is to convert the XML data (that is
ideally incompatible with the processing layer) into CSV to make it compatible.
In short, this phase helps in standardizing the data format and structure.
Data Transformation phase is the actual processing phase. As the name
suggests, it helps in performing data transformation operations like data
mapping, schema building, filtering, grouping, joining, summarizing, merging,
and many more. For example, let us presume your CSV file holds internet log
data of India, but you are interested only in data generated in the Mumbai
location. In this case, using the transformation techniques in this phase, you can
filter the Mumbai data and perform your activities. You can also chain multiple
operations on the data.

Data view phase is concerned with working with cleansed and logically
structured data. This phase is categorized by creating ability to perform real-time
queries to get the desired output and broadcast the same in a web application.
For example, the output generated in the previous phase contains all Mumbai
data. The same can be queried based on certain conditions and can be
broadcasted to a web application. Please note that in this phase, we deal with
structured data. We generally use NoSQL components or SQL components to
achieve this phase.

Visualization and Analytics phase gathers intelligence over the data and its
subsequent usage. As an example, we can consider data representation in the
form of a two-dimensional graph or predictions, recommendations, and so on.
Introducing Hadoop
Apache Hadoop is a Java-based framework that is meant for distributed batch
processing and distributed storage for extremely large amounts of data. It is a
highly scalable and reliable framework across a cluster of commodity hardware.
Hadoop can handle any kind of data, be it structured, semi-structured, or
unstructured data. However, Hadoop was designed to handle unstructured data.
Hadoop was created by Doug Cutting, the creator of Apache Lucene, a widely
used text search library. Hadoop has its origins in Apache Nutch, an open source
web search engine, that itself is a part of the Lucene project.
The core of Apache Hadoop consists of the storage part of Hadoop, officially
termed as HDFS i.e. Hadoop Distributed File System. The processing part of
Hadoop is named MR programming model i.e. the MapReduce model. Apache
Hadoop MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System.

Apache Hadoop has two generations. They are Generation1 and Generation 2.
In Generation1, there exists HDFS and MR. In Generation2, there exists HDFS
and YARN. YARN stands for Yet Another Resource Negotiator and is meant for
cluster management and processing tool management. The default auxiliary
processing service in Generation2 Hadoop is MRv2. In this book, you will learn
about each of these components in detail.
Features of Hadoop
Let us see some key features of Apache Hadoop.

1. Hadoop supports horizontal scalability. You can easily add and


decommission nodes depending on the project requirement with ease.
2. Hadoop is fault tolerant when it comes to data storage. Hadoop ideally
replicates data in multiple nodes ensuring that the data is accessible in
a failure-like scenario.
3. Hadoop is faster when it comes to batch processing. This is because of
the MR programming model that supports parallel distributed
processing. In short, multiple nodes in the cluster participates in
parallel for execution of a single program.
4. Hadoop is opensource software. Thus, if you have a dedicated research
team in your organization, you can re-code and customize the entire
framework based on the business requirement.
5. Hadoop computation moves to the data for lower latency and
bandwidth. Ideally, the code travels to the data when it comes to
Hadoop computation (data locality).

Roles and Responsibilities of Hadoop Administrator in Bigdata


Industry
Hadoop administrator is now considered to be one of most demanding jobs in
the Information Technology sector. Every organization that performs analytics,
sells products containing Bigdata analytics feature, uses Bigdata-ready software
is now looking for skilled Linux administrators who can adapt themselves based
on the application and project that the cluster is designed for. It is difficult to list
all the possible responsibilities as it changes dynamically depending on the
product hosted in the Hadoop Architecture and company requirements.
However, to list a few:

Install, configure, and manage Hadoop cluster based on the capacity


planning provided by Bigdata Solution Architects. This is something
that is also handled using Devops practices. Tools like Puppet, Chef,
and Ansible are used to deploy clusters based on demand.
Upgrading Hadoop cluster (Realtime).
Read and interpret the error logs and derive solutions from the
concerned departments.
Scaling of nodes based on demand-supply business metrics.
Implementing Security standards in the cluster.
Fine tuning the cluster at OS and Hadoop framework level.
Incident Management and Reporting of the cluster.
Backup, Recovery, and DR Management.
Attending meetings during customer feedback, analyzing their
problems, and finding fixes for the same.
Interacting and maintaining relations with the commercial subscription
providers of Hadoop (like Cloudera, Hortonworks etc) and staying
active in JIRA of Apache Hadoop.
Giving feedback and suggestions to research team for the creation of
hot fixes (if organization has a research and development team
dedicated for Hadoop).
Ensuring inter compatibility of the cluster with the Hadoop ecosystem
components to ensure there are no communication errors.

Creating a Lab Setup for Hands-on Exercises


This book is designed keeping in mind a more hands-on exercise to get a grip on
the Hadoop framework and its eco-system components. You are expected to
have a desktop or laptop with following specifications:
Minimum Requirements:

8GB RAM.
150GB free hard disk space.
Intel Core i3 processor or later supporting Intel VT-x and VT-d.
Windows 7 or later/Ubuntu 12.04 or later/Mac OSX or later.
Virtualization Software like VMware Workstation or Oracle
VirtualBox.

Recommended Requirements:

16GB RAM.
250GB free hard disk space.
Intel Core i5 processor or later supporting Intel VT-x and VT-d.
Windows 7 or later/Ubuntu 12.04 or later/Mac OSX or later.
Virtualization Software like VMware Workstation or Oracle
VirtualBox.

We will be using Virtual Machines for our practice. However, feel free to use
multiple machines connected in the network if applicable to you.
I will be using Ubuntu 14.04 LTS Server OS Virtual machine for my hands-on
exercise. You can download these VMs using the link
https://bigdataclassmumbai.com/hadoop-book
Following are the things you will get in the above link:

1. Video - Creating an Ubuntu 64-bit VM and installing Ubuntu 14.04


Server using VMware Workstation.
2. Video - Creating a Development VM for building tools using VMware
Workstation.
3. Download link for the readymade VMs.
4. Other Updates related to the book.

Summary

Bigdata is a term - not a technology, neither a tool. Bigdata emphasizes


the ability of the software, or framework, or architecture to handle the
data.
A typical Bigdata project consists of 5 phases. These are Data acquisition,
pre-processing, transformation, data view, and analytics.
Apache Hadoop is a Java-based framework that is meant for distributed
batch processing and distributed storage for extremely large amount of
data.
Hadoop is faster when confronted with batch processing.
Chapter 2
Apache Hadoop Installation and Deployment

In this chapter, you will learn how to:

Build Hadoop from Scratch.


Set up Hadoop in Standalone Mode.
Set up Hadoop in Pseudo-distributed mode.
Set up Hadoop in Distributed mode.
Common Configuration properties and its usages.

This chapter contains plenty of useful hands-on exercises. It is recommended


that you download all necessary support files and VMs before performing them
in your system. Let us first learn how to build Hadoop from Scratch.

Building Hadoop from Scratch


You may question the viability of building Hadoop from Scratch. You may also
question what benefits you might gain from it. When it comes to learning
Hadoop, it is best to build Hadoop in your environment before installing the
same. The reason for this is that the Apache Hadoop website provides 32-bit
binaries which are not considered to be feasible to use it in a Bigdata proof of
concept (POC). The intent behind building Apache Hadoop is to make it
compatible with 64-bit machines and OS. This will ensure Hadoop uses the
resources effectively (especially your RAM) since in a 32-bit distribution, the
maximum amount of RAM usable irrespective of hardware provided, is 4GB.
Let us proceed with the building process.
Lab 1 Building Apache Hadoop 2.8.0 from Scratch

Step 1: Login to root.


sudo su
Step 2: Install all dependencies required for building Hadoop
add-apt-repository ppa:george-edison55/cmake-3.x
apt-get update
apt-get install build-essential
apt-get install software-properties-common
apt-get install cmake
apt-get install subversion git
apt-get install zlib1g-dev
apt-get install libssl-dev
apt-get install ant
Step 3: Download and extract Apache Maven (I used apache-maven-3.5.0.tar.gz)
wget http://mirror.fibergrid.in/apache/maven/maven-
3/3.5.0/binaries/apache-maven-3.5.0-bin.tar.gz
Incase the above link fails, you can get the latest link from
http://maven.apache.org/download.cgi
tar -xvzf apache-maven-3.5.0-bin.tar.gz
Step 4: Setup Maven environment and update the same vi
/etc/profile.d/maven.sh
#Add the following lines in the file
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64/
export M3_HOME=/home/hadoop/apache-maven-3.5.0
export PATH=$PATH:$JAVA_HOME/bin:$M3_HOME/bin
Step 5: Update the script in OS
source /etc/profile.d/maven.sh
Step 6: Install OpenJDK7 (You can also use Oracle JDK)
apt-get install openjdk-7-jdk
Step 7: Download and Build Protocol Buffers from Google Developers website
wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-
2.5.0.tar.gz
tar -xzvf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure
make
make install
Step 8: Download Hadoop Source tar (I am using hadoop-2.8.0 ), extract the
same in the home folder and start the building process.
export LD_LIBRARY_PATH=/usr/local/lib
wget http://mirror.fibergrid.in/apache/hadoop/common/hadoop-
2.8.0/hadoop-2.8.0-src.tar.gz
Incase the above link fails, you can get the latest link from
https://archive.apache.org/dist/hadoop/common/
tar -xvzf hadoop-2.8.0-src.tar.gz
cd hadoop-2.8.0-src
mvn package -Pdist,native -DskipTests -Dtar
Once the build is successful, you can get your hadoop binaries at
/home/hadoop/hadoop-2.8.0-src/hadoop-dist/target/hadoop-
2.8.0.tar.gz Ensure that you copy the tar file in the home folder of the
machine as you will use this file for installation and configuration of
Hadoop.
cp -r /home/hadoop/hadoop-2.8.0-src/hadoop-dist/target/hadoop-
2.8.0.tar.gz /home/hadoop/.
These are the following things you get when you build Hadoop in a 64-bit
machine:

1. Warning message not available/disappears since Hadoop now has


correct 64-bit libraries.
2. It can use more than 4GB RAM per node.

Ideal Hadoop Setups


Apache Hadoop can be installed in 3 modes viz.

1. Standalone Mode (CLI MiniCluster).


2. Pseudo-Distributed Mode (Single Node Cluster).
3. Distributed Mode ( Multi Node Cluster).
Let us learn all these configurations step-by-step.

Setting up Hadoop-2.8.0 in Standalone Mode (CLI MiniCluster)


In this section, we will learn how to setup Hadoop in Standalone
mode. In Standalone mode, all Hadoop operations will run in a
single JVM. In Hadoop Standalone mode, your local file system is
used as the storage and a single JVM that will perform all MR-
related operations. Let us see how to setup CLI MiniCluster: Lab
2 Setting up Apache Hadoop 2.8.0 in Standalone Mode
(CLI MiniCluster Mode)
Step 1: Ensure package lists are updated.
sudo apt-get update
Step 2: Install Java 7. We are going to use OpenJDK7 however you can feel free
to use Oracle JDK 7.
sudo apt-get install openjdk-7-jdk
java -version
Step 3: Install SSH.
sudo apt-get install openssh-server
Step 4: Extract the Hadoop binary tar file that you have built and copied in the
home folder.
tar -xvzf hadoop-2.8.0.tar.gz
Step 5: Rename the extracted folder. This is done for our comfort.
mv hadoop-2.8.0 hadoop2
Step 6: Setup Environment Variables to identify Hadoop executables,
configurations, and dependencies. You will need to edit the .bashrc file that is
available in the home folder.
vi .bashrc
#Add the below lines at the start of the file
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export HADOOP_INSTALL=/home/hadoop/hadoop2
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/nativ
export HADOOP_OPTS=”-
Djava.library.path=$HADOOP_INSTALL/lib”
export
HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export PATH=$PATH:$HADOOP_CONF_DIR/bin
export PATH=$PATH:$YARN_CONF_DIR/sbin
Incase you are using Oracle Java, please ensure your JAVA_HOME path will
be /usr/lib/jvm/java-7-oracle.
Step 7: Refresh and apply the environment variables.
exec bash
Step 8: Inform Hadoop where Java is
vi /home/hadoop/hadoop2/libexec/hadoop-config.sh #Add the
following line at the start of file
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
Step 9: Setup Hadoop Environment Variables. This is mostly used by HDFS
shell scripts that are present in the sbin location of the Hadoop framework.
vi /home/hadoop/hadoop2/etc/hadoop/hadoop-env.sh #Add the
following line at the start of file
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64/
export HADOOP_INSTALL=/home/hadoop/hadoop2
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/nativ
export HADOOP_OPTS=”-
Djava.library.path=$HADOOP_INSTALL/lib”
Step 10: Setup YARN variables. This is mostly used by YARN shell scripts
present in the sbin location of the Hadoop framework.
vi /home/hadoop/hadoop2/etc/hadoop/yarn-env.sh #Add the
following line at the start of file
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64/
export HADOOP_HOME=/home/hadoop/hadoop2
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
After setting up all environment variables, Hadoop has now been installed in
Standalone mode. Let us now test our installation. To do so, create two PuTTY
sessions. One session is for monitoring JVMs using jps command, and the other
session can be used to execute a MapReduce application in Standalone mode.
Step 11: We will use the example jar file provided by the framework. This is
available in the location/home/hadoop/hadoop2/share/hadoop/mapreduce
folder. Now create two PuTTY Sessions. In the first session, type the following
command as shown in the snapshot image below: watch -n 1 jps
The resultant will be that the jps command will run every 01 (one) second
which will enable monitoring of any newly created JVMs. In the second session,
type the following command to execute a WordCount Program: hadoop jar
/home/hadoop/hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-
examples-2.8.0.jar wordcount /home/hadoop/hadoop2/README.txt
/home/hadoop/OutputWC
The syntax is as follows:
hadoop jar <jar_file> <prog_name> <input> <output> where,
jar_file – The location of MR JAR which is to be executed in Hadoop Cluster.
prog_name – The name of class which holds the main method.
input – The file input location that is to be processed.
output – The folder output location where all the output will be stored.
While the program is being executed, observe the first PuTTY session. You
will see a JVM that has been invoked. This takes on the entire accountability to
execute your MapReduce program. Officially, this is called CLI MiniCluster as
per Hadoop documentation, in which all the required operations are done in a
single JVM.
Now you have learned how to setup a CLI MiniCluster. One of the most
frequently asked questions while performing this setup is, “Why do we need to
learn this? Nobody uses this kind of setup in production anymore.” You might
also question the benefit that you gain out of this setup. The answer to both
questions is that you need to learn this because, to setup a Multinode cluster, you
first need to setup Hadoop in Standalone mode.
Let us now learn how to install Hadoop in SingleNode cluster or in Pseudo-
distributed mode.

Setting up Hadoop-2.8.0 in Pseudo-Distributed Mode


In this section, we will see how to setup Hadoop in a single-node cluster. Let us
first understand the basic difference between CLI MiniCluster and Single Node
Cluster.
Feature CLI MiniCluster Single Node Cluster
Storage Standalone mode uses In Single Node clusters, HDFS is present
a local file system for and is installed and limited in a single node.
its storage needs.
Hadoop In this setup, there In this setup, each daemon has its own JVM
Daemons exists no Hadoop which takes all the responsibilities when it
daemons. Instead comes to Storage, Processing and Cluster
storage operations are Management. Storage part is managed by
managed and HDFS services i.e. NameNode and
controlled by OS and DataNode whereas the processing part and
processing operations cluster management is managed by YARN
are managed by Java services i.e. ResourceManager and
using a single JVM. NodeManager.
Number Single Node Single Node
of
Systems
(Nodes)

Let us now setup the single node cluster:


Lab 3 Setting up Apache Hadoop 2.8.0 in Pseudo-Distributed Mode (Single
Node Cluster) Step 1: Setup your node in Standalone (CLI
MiniCluster) mode:

Step 2: Setup SSH password-less setup. It is required for Hadoop daemons since
they get initialized using SSH. This can be done using the following steps:

a. Generate SSH keys.


ssh-keygen (Press Enter till the command gets completed. No need to
fill anything including password since we need to setup a ‘password-
less’ key).
This command will generate the public key(id_rsa.pub) and the private
key(id_rsa) in the home folder under .ssh directory.
b. Register the public key in the node.
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@hadoopvm
Where:
hadoop is the username
hadoopvm is the Hostname of the machine
After pressing the Enter key, you may get a prompt to register the system to
known_hosts. Type ‘yes’ and press Enter. Once done, you will see a prompt for
the password. Enter the user’s password (in our case it will be 123456) and press
Enter. You will see a feedback line that says the key is being added successfully.
Step 3: Setup core-site.xml. This configuration file is responsible for informing
the framework where NameNode daemon will run. It also maintains all Hadoop
core configurations. You can get all configurations with their default values and
descriptions in the link given below:
https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-
common/core-default.xml
vi /home/hadoop/hadoop2/etc/hadoop/core-site.xml <?xml
version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoopvm:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdfsdrive</value>
</property>
</configuration>
Step 4: Setup mapred-site.xml. This configuration is responsible for maintaining
MapReduce job configurations. Right now we will configure MR jobs to run on
top of YARN. This file is not available by default. However a template file is
provided. Let us configure mapred-site.xml.
cp /home/hadoop/hadoop2/etc/hadoop/mapred-site.xml.template
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml vi
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml <?xml
version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 5: Setup hdfs-site.xml. This configuration file is responsible for configuring
your HDFS.
vi /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml <?xml
version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Step 6: Setup yarn-site.xml. This configuration file is responsible for configuring
the YARN related parameters.
vi /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml <?xml
version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services </name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-
services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Step 7: Create a folder named hdfsdrive inside the home folder. This is done as
per the value set in hadoop.tmp.dir in core-site.xml mkdir -p
/home/hadoop/hdfsdrive
Step 7: Format the NameNode. This setup ensures that Hadoop will create a
filesystem in the location that is defined in hadoop.tmp.dir of core-site.xml. In
our case, it is /home/hadoop/hdfsdrive. Be alert towards checking the Hadoop
version prior to formatting as a form of Best Practice. This practice will enable
you to be conscious with the Hadoop version that the system has accepted. This
is required when you are connected to PuTTY using SSH since the environment
may get updated, but the session may not get refreshed. In case the session is not
refreshed, it is recommended that you re-login the SSH shell.
hadoop version
hadoop namenode -format
In the format, if you view the success file as shown below, you can say that
your HDFS has been created successfully.
17/06/10 12:47:45 INFO common.Storage: Storage directory
/home/hadoop/hdfsdrive/dfs/name has been successfully
formatted.
17/06/10 12:47:45 INFO namenode.FSImageFormatProtobuf:
Saving image file
/home/hadoop/hdfsdrive/dfs/name/current/fsimage.ckpt_00000000000000000
using no compression 17/06/10 12:47:45 INFO
namenode.FSImageFormatProtobuf: Image file
/home/hadoop/hdfsdrive/dfs/name/current/fsimage.ckpt_00000000000000000
of size 353 bytes saved in 0 seconds.
17/06/10 12:47:45 INFO namenode.NNStorageRetentionManager:
Going to retain 1 images with txid >= 0
17/06/10 12:47:45 INFO util.ExitUtil: Exiting with status 0
17/06/10 12:47:45 INFO namenode.NameNode:
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at
hadoopvm/192.168.149.152
************************************************************/
Step 8: Let us now start Hadoop daemons. For now, we will learn how to start
the daemons manually.
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
yarn-daemon.sh start resourcemanager
yarn-daemon.sh start nodemanager
mr-jobhistory-daemon.sh start historyserver
Once you are done, type ‘jps’ to ensure that the services are live and active.
You may get an output similar to the one below: 42153 NameNode
42356 ResourceManager
42244 DataNode
42603 NodeManager
42776 Jps
42742 JobHistoryServer
As you can see, each service has its own independent JVM. You now know
how to setup Hadoop in Pseudo-distributed mode.

Setting up Hadoop in Distributed Mode (Multinode Cluster)


In this section, we will see how to setup Hadoop in Distributed mode. The
following diagram depicts the setup we are going to do.
Lab 4 Setting up Apache Hadoop 2.8.0 in Distributed Mode (Multinode
Cluster)

Problem Statement:
We have 4 nodes and we need to setup a Multinode Cluster as shown in the
above network diagram.
Solution
Understand your setup and fill in the table given below. Ensure that in Step 1
(under hosts file configuration) you add the IP address of your machine.
Node IP Address for Example Your Machine/VMs IP
Desired Hostname
Name Purpose address
Node1 node1.mylabs.com 192.168.1.1
Node2 node2.mylabs.com 192.168.1.2
Node3 node3.mylabs.com 192.168.1.3
Node4 node4.mylabs.com 192.168.1.4

Step 1: Setup network Hostname and host file configuration on all 4 machines.
In a real production setup, we use a dedicated DNS server for Hostnames.
However, for our lab setup, we will setup local host files as resolvers. Hadoop
ideally recommends working on Hostname rather than IP addresses for host
resolution.
Perform the following steps in each machine.
sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com node1.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.
192.168.1.1 node1.mylabs.com
192.168.1.2 node2.mylabs.com
192.168.1.3 node3.mylabs.com
192.168.1.4 node4.mylabs.com
Once the configuration is done, you will need to restart the machines for
Hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
Step 2: Setup SSH password-less setup between your NameNode system and the
other system. In our example setup, we need an SSH setup password-less
configuration between node1.mylabs.com and the other nodes participating in
the cluster. This is done so that the node1 can contact other nodes to invoke the
services. The reason for this kind of configuration is that the NameNode system
is the single point of contact for the users and administrators.
Perform the following commands on node1:

a. Generate SSH keys.


ssh-keygen (Press Enter till the command gets completed. You do not
need to fill anything including the password, since we need to setup a
‘password-less’ key)
This command will generate the public key(id_rsa.pub) and the private
key(id_rsa) in the home folder under .ssh directory.
b. Register the public key in the node1, node2, node3, and node4.
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@node1.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub hadoop@node2.mylabs.com
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@node3.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub hadoop@node4.mylabs.com
After pressing Enter, you may get a prompt to register the system to
known_hosts. Type ‘yes’ and press Enter. Once that’s done, you may see a
prompt for the password. Enter the user’s password (in our case it will be
123456) and press Enter. You will see a feedback line stating that the key has
being added successfully.
Step 3: Install Hadoop 2.8.0 in Standalone mode in all 4 nodes. Refer Setting up
Hadoop-2.8.0 in Standalone Mode (CLI MiniCluster) in case you require
additional help.
Step 4: Setup core-site.xml in Node1 and copy the same in other nodes that are
participating in the cluster. Add the following configuration in core-site.xml vi
/home/hadoop/hadoop2/etc/hadoop/core-site.xml <?xml version=”1.0”
encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://node1.mylabs.com:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdfsdrive</value>
</property>
</configuration>
In the above configuration, we address the cluster where the Namenode
resides (node1.mylabs.com) and at which port it is listening (9000). Once done,
save the file and copy it into other nodes. This step is mandatory since this file is
accountable to inform the network locality of the Namenode system in the
cluster. To copy it to remote systems, perform the following command: scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml Step 5: Setup hdfs-site.xml in Node1 to configure distributed storage
(HDFS) configurations and copy the same in other nodes that are participating in
the cluster.
vi /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml <?xml
version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoopvm:50090</value>
</property>
</configuration>
In the above configuration, we have informed the cluster of how many replicas
of each block must be maintained. We will learn more about HDFS and
replication in the coming chapters. We are also informing the cluster about the
network locality of the SecondaryNamenode system. Now, save the file and copy
it into other nodes. To copy the file in remote systems, perform the following
command: scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml Step 6: Setup mapred-site.xml in Node1 and copy the same in all other
nodes that are participating in the cluster.
cp /home/hadoop/hadoop2/etc/hadoop/mapred-site.xml.template
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml vi
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml <?xml
version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
In the above configuration, we are informing the cluster that all MapReduce
Jobs will run on top of YARN. Let us now copy this file in remote systems
participating in the cluster.
scp -r /home/hadoop/hadoop2/etc/hadoop/mapred-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml Step 7: Setup yarn-site.xml. This configuration file is responsible
for configuring your YARN related parameters.
vi /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml <?xml
version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-
services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-
tracker.address</name> <value>master1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master1:8040</value>
</property>
</configuration>
In the above configuration, we are informing NodeManager to YARN
marpreduce_shuffle auxillary during any MR related application executions.
Since we are creating a Multinode cluster, it is mandatory to specify the
resource-tracker RPC address which will be used by the NodeManager to
communicate with the ResourceManager using
yarn.resourcemanager.resource-tracker property.
We will also need to specify the RPC address and port of the
ResourceManager’s Scheduler which is used by the Application Master for any
communication using yarn.resourcemanager.scheduler.address.
Finally, we specify the ResourceManager RPC address and port which can be
used by external client applications to contact the ResourceManager. Let us now
copy this file in remote systems participating in the cluster.
scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml Step 8: Setup Slave configuration in the cluster. You need to
perform this step only in Node1. This configuration informs the cluster
about which nodes will have the slave services (DataNode and
NodeManager). Usually, this configuration is placed only in the NameNode
machine.
vi /home/hadoop/hadoop2/etc/hadoop/slaves
#Delete localhost and replace the same with node3 and node4 as
shown node3.mylabs.com
node4.mylabs.com
Step 9: Format the NameNode. You need to perform this step only in Node1
(whichever node is going to act as the primary NameNode). This setup ensures
that Hadoop will create a filesystem in the location which is defined in
hadoop.tmp.dir of core-site.xml. In our case, it is /home/hadoop/hdfsdrive.
It is recommended that you check the Hadoop version prior to your formatting
as a Best Practice. This practice will enable you to be conscious with the Hadoop
version that the system has accepted. This is required when you are connected to
PuTTY using SSH, since sometimes the environment may get updated but the
session may not get refreshed. In case the session does not get refreshed, it is
recommended that you re-login the SSH shell.
hadoop version
hadoop namenode -format
If you see the success file as shown below (in the format), you can say that
your HDFS has been created successfully.
17/06/10 12:47:45 INFO common.Storage: Storage directory
/home/hadoop/hdfsdrive/dfs/name has been successfully
formatted.
17/06/10 12:47:45 INFO namenode.FSImageFormatProtobuf:
Saving image file
/home/hadoop/hdfsdrive/dfs/name/current/fsimage.ckpt_00000000000000000
using no compression 17/06/10 12:47:45 INFO
namenode.FSImageFormatProtobuf: Image file
/home/hadoop/hdfsdrive/dfs/name/current/fsimage.ckpt_00000000000000000
of size 353 bytes saved in 0 seconds.
17/06/10 12:47:45 INFO namenode.NNStorageRetentionManager:
Going to retain 1 images with txid >= 0
17/06/10 12:47:45 INFO util.ExitUtil: Exiting with status 0
17/06/10 12:47:45 INFO namenode.NameNode:
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at
hadoopvm/192.168.149.152
************************************************************/
Step 8: Now let us start Hadoop daemons. Type the following command on
Node1: start-dfs.sh (It will start HDFS services i.e. NameNode, DataNode and
SecondaryNamenode) start-yarn.sh (It will start YARN services i.e.
ResourceManager and NodeManagers) mr-jobhistory-daemon.sh start
historyserver (It will start YARN MR HistoryServer) Summary

It is usually recommended that you should build Apache Hadoop from


Scratch to install Hadoop in a 64-bit machine, since the default binaries
offered in ASF are
32-bit binaries.
In Standalone mode, only one JVM is responsible for the entire execution
operation.
In Standalone mode, there is no HDFS filesystem available. Instead, it
uses local filesystem for storage needs.
Pseudo-distributed mode setup is generally used by developers to test and
debug their applications before deploying them in test-bed and production
setups.
Chapter 3
Demystifying HDFS

In this chapter we will learn:

What is HDFS?
HDFS Architecture.
HDFS Write Operation.
HDFS Read Operation.
Common FileSystem Shell Commands.
Replication in HDFS.
Setting up Block Size in HDFS.
Adding Nodes in clusters.
Decommissioning Nodes in clusters.
Whitelisting DataNodes.
Safemode in Hadoop.
Implementing NFS Gateway in HDFS.
Configuring DataNode heartbeat interval.
Setting up Name Quotas and Space Quota in HDFS.
Enabling Recycle Bin for Hadoop HDFS (Trash).

What is HDFS?
HDFS stands for Hadoop Distributed File System. It is the storage part of
Hadoop that allows Bigdata files to be stored across the cluster in a distributed
and reliable manner. HDFS allows the distributed application to access the data
for distributed processing and analysis quickly and reliably. HDFS is a non-
POSIX compliant FileSystem since it is an append-only FileSystem and does not
honor POSIX durability semantics.
Let us try and understand what HDFS is all about. What does it do? HDFS has
been made to handle large files. HDFS can handle small files but ideally HDFS
is designed to handle large files in the most optimal manner such that the file can
be distributed, broken down, and stored in an entire cluster. Bigdata can be data
from sensors, satellite data, user activity tracking, webserver logs and many
more. HDFS stores the data by breaking it into smaller blocks. These blocks are
the logical units of data that are stored in HDFS. In this case, the default block
size is 64MB in Generation1 Hadoop and 128MB in Generation2 Hadoop. The
benefit of splitting up of large files into blocks is multiple systems in the cluster.
They will maintain the blocks resulting in participation of multiple systems
during processing, hence introducing parallel processing in the cluster. This
makes your operation faster as compared to operations running on a single
machine.
HDFS internally stores the data in the form of blocks that are stored across
several nodes. It can store multiple backup copies of individual blocks to
maintain high availability of data stored in HDFS and can also tolerate failures
(when one of the computers in the cluster goes down/stops functioning). The key
factor of HDFS is that you can achieve the same results using commodity
hardware. There is no need to work on specialized hardware.
Now that we know HDFS in a higher level, let us explore the corresponding
architecture.

HDFS Architecture
The following architecture depicts a typical HDFS setup in a 5 node cluster:

As shown above, there exist 3 daemons responsible to uptime HDFS. These


are:

a. NameNode
b. DataNode
c. SecondaryNameNode

The above cluster shown has a total HDFS space of 1500GB. This is because
the system that holds the DataNode service is only considered by HDFS for its
storage operations. Let us now learn the role of each daemon in detail.
NameNode daemon is the master daemon for HDFS storage. NameNode is the
single point of contact for distributed applications and clients. It is the
NameNode that maintains the index information of the files that are stored in
HDFS. When we say index information, we refer to Namespace information and
block information. NameNode also maintains the mapping between files and
blocks. There is a logical entity that is stored in the HDFS file in the form of a
set of blocks (where blocks are the actual physical entity that is maintained by
DataNode). NameNode maintains all the index information in a critical file
called fsimage and edits. The edits file is responsible for maintaining index
information of the on-going session in the main memory. However, it is snapshot
in the persistent storage (after NameNode daemon starts), whereas fsimage is
meant to maintain information of the entire blockpool in persistent storage when
the NameNode daemon starts.
DataNode daemon is the slave daemon for HDFS storage. The DataNode
daemon is responsible for performing read/write operations on data that is stored
in the HDFS. It is the DataNode daemon that maintains the blocks. It also
performs the replication asynchronously whenever applicable. The default
replication factor of HDFS is 3 which will mean that 3 copies of each block will
be stored in the HDFS layer if applicable. It is the DataNode that interacts with
the application or client when it comes to PULL or PUSH operation of blocks.
SecondaryNameNode daemon is also called the Checkpoint Node in the
Hadoop community. As the name suggested by Hadoop community, this daemon
is used to maintain the checkpoint of the HDFS metadata. Let us try to
understand the need of the SecondaryNameNode before getting in to further
detail about the same. As discussed above, NameNode maintains the HDFS
metadata in two files namely, fsimage and edits, wherein edits maintains the
changes that happen in the FileSystem of an on-going session. Now, the edits
logs that are generated will be applied to fsimage only when we restart the
NameNode service or manually flush the same. But in reality, restarting
NameNode in a production cluster is very rare. It results in high volume of data
in edits logs that will be challenging to manage. If we assume that there is a
crash, the entire edits data gets deleted. Thus resulting in data loss and orphaned
blocks. To overcome this issue, SecondaryNameNode comes handy. It takes over
the responsibility of merging edits with fsimage. The whole purpose of
SecondaryNameNode is to perform a checkpoint with NameNode and to ensure
that the entire snapshot of the HDFS is safely maintained. However, this is not a
replacement or failover solution. In Generation 1, we rely on
SecondaryNameNode daemon for recovery of HDFS. It is, however, not
considered a feasible solution for NameNode failures using manual techniques.
Thus in Generation2, Hadoop introduced the concept of a standby NameNode
which we will see later.
Since we have now understood the daemons and their roles, let us understand
how HDFS performs write operation.

HDFS Write Operation


Let us take a scenario to understand HDFS write operation as shown in the

diagram:
As shown above, we have a client machine that wants to upload the file named
file1.txt from this machine to HDFS. The size of the file is 120MB. Let us take
the scenario shown below: Total HDFS space: 1500GB
Available HDFS space: 1500GB
Block Size set for HDFS cluster: 64MB
Replication Factor: 2
Let us understand this step-by-step process:
Step 1:
Client will initiate the process of uploading data using the HDFS Client. The
HDFS client will logically figure out how many blocks should be created for the
input file. In our case, the HDFS client will break the file into two blocks (based
on the block size of 64MB). For each block, the HDFS client will contact
NameNode for the metadata and DataNode information where the data is to be
stored. The NameNode will receive the request to create a file and will initially
check for the availability of space to store the file. Considering the replication
factor, it then checks whether the file exists or not and for any Quotas if
applicable. In this case, the file size is 120MB and replication factor is 3. Thus,
the required space to perform this operation is 360MB. If this space is available,
only then will NameNode initiate the second step, else it will throw an
IOException.
Step 2:

NameNode, based on the information received from the client, will start
generating the metadata and block information that includes

1. the name of the block that will be a big integer value.


2. DataNode hostnames/IP will own that specific block and
3. its replica blocks.

In our example, there will exist two blocks namely B1 and B2 that will be
stored in DN1 and DN2 for B1 and DN2 and DN3 for B2 considering the
replication factor of the cluster as 2. Once the metadata is generated, the same
shall be given to the client as a response.
Step 3:
The Client, based on the metadata received, will initiate the block transfer to
the destined DataNode. The Client will also ensure that the DataNode will
initiate the replication process once the block is copied in DataNode. In our case,
the client machine will initiate block transfer B1 to datanode DN1 and will also
inform DN1 that initiating the replication process to DN2 has to be as shown

below:
This process is called the DataNode pipeline that takes the accountability of
replication. Once DN2 receives the data, a FINISH flag will be sent to DN1.
From DN1, the FINISH flag will be sent to both NameNode and Client. The
same process happens for block B2 also.
Once the FINISH for each block is received by the client, the client will then
send the FINISH flag to the NameNode. Also at the same time, DataNodes will
send the block report to NameNode confirming that all blocks are placed as per
the specification. Let us now understand read operation in HDFS.

HDFS Read Operation


Let us take a scenario to understand how HDFS read operation works. Assume
that the client wants to read/download file1.txt that resides on HDFS. Let us
understand the process step-by-step.

Step 1: Client will initiate a request to NameNode to download file1.txt.


NameNode will then check whether this file exists and will also ensure that the
DataNodes associated with the file are live and active.
Step 2: Once NameNode ensures that the DataNodes are alive, it will share the
metadata containing the block information onto the client machine.
Step 3: Client will read the metadata and based on it, the client will connect to
the respective DataNodes that hold the blocks associated with the file.
Step 4: Once the blocks are downloaded, they are then merged by the client.
The merged final file will be stored in the intended destination location.
Let us go ahead and learn some of the common HDFS related operations.
Lab 5 Working with HDFS Filesystem Shell Commands

Common FileSystem Shell Commands:


Command
Usage and Example
Name
-mkdir This command is used to create a new directory in HDFS.
e.g. hdfs dfs -mkdir /data
-ls This command is used to list the directory specified.
e.g. hdfs dfs -ls /
Incase if you want to see the entire structure up to last level,
use -R
e.g. hdfs dfs -ls -R /
- This command is used to upload the data from local
copyFromLocal FileSystem to HDFS.
e.g. hdfs dfs -copyFromLocal /home/hadoop/sample
/data/sample Please note that /data directory in HDFS must
be created before uploading the data. copyFromLocal does
not support directory creation.
-put This command is used to upload data from the local
FileSystem to HDFS. Ideally, there exists no difference
between -copyFromLocal and -put.
e.g. hdfs dfs -put /home/hadoop/sample /data/sample
-copyToLocal This command is used to download the data from HDFS back
to local FileSystem.
e.g. hdfs dfs -copyToLocal /data/sample
/home/hadoop/sampleDownload
-get This command is similar to -copyToLocal.
e.g. hdfs dfs -get /data/sample
/home/hadoop/sampleDownload
-cat This command is used to view the contents of the file
residing in HDFS FileSystem.
e.g. hdfs dfs -cat /data/sample
-checksum This command returns the checksum information of the file
residing in HDFS.
e.g. hdfs dfs -checksum /data/sample
-chmod This command changes the file permissions. You can use -R
to recursively apply to all folders.
e.g. hdfs dfs -chmod 777 /data/emp
-chown This command changes the owner and group of the file.
e.g. hdfs dfs -chown hdfs:hadoop /data/sample Before
initiating this command, ensure a user named ‘hdfs’ belongs
to the group ‘hadoop’ exists in Linux.
-cp Used to copy files in HDFS from one location to another.
e.g. hdfs dfs -mkdir /data1
hdfs dfs -cp /data/sample /data1/sample
-rm Used to delete the file or folder from HDFS.
e.g. hdfs dfs -rm /data/sample
hdfs dfs -rm -r /data1

Working with HDFS Replication


Let us suppose that you want to increase the replication factor of your existing 5
node cluster (that you have created earlier) from 1 to 2. We can do the same by
modifying hdfs-site.xml without even introducing downtime in the cluster. Let
us try to achieve the same.
Lab 6 Setting up Replication Factor of an Existing Cluster

To continue with this demo, please create a 5 node cluster as explained in the
previous chapter.
Step 1: Open hdfs-site.xml in node1.mylabs.com and add the following property
within the configuration tag.
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Step 2: Copy the modified hdfs-site.xml file in the remaining nodes.
scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml Step 3: Try uploading file
hdfs dfs -mkdir /testdata
hdfs dfs -copyFromLocal /home/hadoop/sample /testdata/sample
hdfs dfs -stat %r /testdata/sample
You will see that the file sample is replicated in 2 DataNodes. You can also
check the same in WebUI.
Please note, while using this method, we have set the replication factor for all
future uploads. All existing files will still hold the same replication factor that
was been set during the upload time. Lets take an example to change the
replication factor of the existing file.
hdfs dfs -mkdir /data1
hdfs dfs -put /home/hadoop/sample /data1/sample
hdfs dfs -stat %r /testdata/sample
The above command will upload the file with RF=2. Let us assume you want
to change RF to 1. You can do it using -setrep command as shown below: hdfs
dfs -setrep 1 /data1/sample
hdfs dfs -stat %r /testdata/sample
Lab 7 Dynamically Setting up Replication Factor during Specific File
Upload

You can also set the replication factor during the upload. This can be done using
the following command: hdfs dfs -D dfs.replication=1 -copyFromLocal
/home/hadoop/sample /testdata/sample2

Setting up Block Size in Existing HDFS Cluster


In this section, we will learn how to setup block size in real time.
Lab 8 Setting up Block Size in Existing Hadoop Cluster

In this scenario, we will setup block size as 64MB. We need to setup hdfs-
site.xml with the following property: <property>
<name>dfs.blocksize</name>
<value>64m</value>
</property>
After setting it up, try uploading the file. You will observe that the block size
set for this file will be 64MB. You can represent the value suffix as k (kilo), m
(mega), g (giga), t (tera), p (peta), e (exa) to specify the sizes. If you don’t
specify any suffix, the same will be considered in bytes.

Adding Nodes in the Existing Cluster in Realtime


Hadoop ideally supports Horizontal Scalability. Here, we will see how to add a
node in the existing cluster. Ideally, the reason why we would do this is:

a. To increase storage space


b. To increase processing power

In this section, we will see how to add the 5th node in the existing 4 node
Hadoop cluster created as per the previous chapter. This cluster is running live
and active. We will add a new node named node5.mylabs.com.
Lab 9 Adding Nodes in an Existing Hadoop Cluster without Cluster
Downtime

Perform the following steps to achieve the same.


Step 1: Setup network hostname and host file configuration on the 5th machine.
Perform the following steps in each machine: sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com node5.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.

Once the configuration is done, you will need to restart the machine for
hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
You need to also make an entry of the new system in the host file of
node1.mylabs.com, node2.mylabs.com, node3.mylabs.com, and
node4.mylabs.com. Please note that you do not need to restart, since we are only
modifying the host file and not touching the hostname file. The intent of this step
is to resolve node5 by all machines participating in the cluster.
Step 2: Setup SSH password-less setup between your NameNode system and the
node5.mylabs.com node. In our example setup, we need to setup password-less
configuration between node1.mylabs.com and node5.mylabs.com. This is done
so that the node1 can contact the node5 to invoke hadoop services. The reason
for this kind of configuration is that the NameNode system is the single point of
contact for all the users and administrators.
Perform the following commands on node1:
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@node5.mylabs.com Step 3: Install Hadoop 2.8.0 in Standalone
mode in node5.mylabs.com. Refer Setting up Hadoop-2.8.0 in Standalone
Mode (CLI MiniCluster) in the previous chapter in case you require
additional help.
Step 4: Copy core-site.xml, mapred-site.xml, hdfs-site.xml, and yarn-site.xml
from node1.mylabs.com to node5.mylabs.com.
On node1.mylabs.com perform the following commands,
scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node5.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml
hadoop@node5.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/mapred-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node5.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node5.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml Step 5: Setup Slave configuration in the cluster. We need to
perform this step only in Node1. This configuration helps in informing the
cluster, which nodes will have Slave services (DataNode and NodeManager).
Usually, this configuration is placed only in the NameNode machine. You
need to just append node5.mylabs.com on the last line of the file.
vi /home/hadoop/hadoop2/etc/hadoop/slaves
#Delete localhost and replace the same with node3 and node4 as
shown node3.mylabs.com
node4.mylabs.com
node5.mylabs.com
Step 6: Start DataNode and NodeManager service in node5.mylabs.com
hadoop-daemon.sh start datanode
yarn-daemon.sh start nodemanager
Verify whether the services are running or not using ‘jps’ command. Also get
the block report to ensure node5 is now a part of the clustr successfully using the
following command Step 7: Since we have already scaled the cluster, we will
need to balance the load of the cluster. This can be achieved using the following
command: On node1.mylabs.com, type the following command
start-balancer.sh
Decommissioning the Datanodes in Existing Cluster
Decommissioning is a process by which we gracefully remove the DataNode
from the running cluster without affecting any storage and processing activities
triggered by Hadoop or similar applications. The responsibility of that specific
DataNode which is planned to be decommissioned will be assigned to other
DataNodes and the NameNode will keep a track of it and update the same in the
metadata.
Lab 10 Decommissioning Datanode in Existing Hadoop Cluster Without
Data Loss and Downtime in the Cluster In this section, we will see how
to decommission the 4th node in the existing 5 node Hadoop cluster.

Step 1: Create a file in the NameNode system i.e. node1.mylabs.com that holds
the hostname of the node which is to be decommissioned. You can set any name
to this file. In this case, let us create a file named ‘remove.txt’.
vi /home/hadoop/remove.txt
node4.mylabs.com
Step 2: Open hdfs-site.xml and add the following property within the
configuration tag.
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/remove.txt </value>
</property>
Step 3: Refresh the cluster to apply the changes in real time.
hdfs dfsadmin -refreshNodes
You will observe that the changes will be applied and node4.mylabs.com will
change the state from NORMAL to DECOMMISSIONING IN PROGRESS.
Once decommissioning is complete, the state will change to
DECOMMISSIONED. You can track the status either using WebUI of
NameNode or by using the following command: hdfs dfsadmin -report

Whitelisting DataNodes in the Cluster


In this section, we will learn how to whitelist hosts to become DataNodes in the
cluster. The intention towards learning this technique is to avoid any random
DataNodes from join the cluster. Let us see how to achieve the same.
Lab 11 Whitelisting Datanodes in an Hadoop Cluster

Step 1: Create a file in the NameNode system i.e. node1.mylabs.com that holds
the set of Hostnames of the nodes that are to be whitelisted. You can set any
name to this file. In this case, let us create a file named ‘allow.txt’.
vi /home/hadoop/allow.txt
node3.mylabs.com
node4.mylabs.com
node5.mylabs.com
Step 2: Open hdfs-site.xml and add the following property:
<property>
<name>dfs.hosts</name>
<value>/home/hadoop/allow.txt</value>
</property>
Step 3: Refresh the cluster to apply the changes in real time.
hdfs dfsadmin -refreshNodes
Understanding Safemode in Hadoop
Safemode is a representation that Hadoop is in read-only mode. It is ideally
meant for transitioning the cluster from production to maintenance mode. In this
mode, HDFS doesn’t allow any write-related and processing operations.
However, read is possible. When we start Hadoop services, HDFS service goes
in safemode to perform the following activities:

a. Flush edits to fsimage and load fsimage in-memory.


b. Get the block report from each DataNode and check the compliance of
metadata.
c. Report and update replication parameters (under-replicated and over-
replicated blocks).

However, a user can bring the Hadoop cluster in safemode for the following
reasons:
a. Check-pointing metadata.
b. Removing orphaned blocks.
c. Removing corrupted blocks and its metadata.

Lab 12 Working with Safemode (Maintenance Mode) in Hadoop

Let us see how to control safemode activity.


To check whether the cluster is in safemode or not:
hdfs dfsadmin -safemode get
To switch the cluster from production to maintenance mode:
hdfs dfsadmin -safemode enter
To switch the cluster from maintenance to production mode:
hdfs dfsadmin -safemode leave
To check whether the HDFS is out of safe mode or not:
hdfs dfsadmin -safemode wait
How to do checkpointing manually?
Lab 13 Checkpointing Metadata Manually

Lets learn how to perform checkpointing:


Step 1: Ensure the Hadoop cluster is in safemode.
hdfs dfsadmin -safemode enter
Step 2: Commit the edit logs to fsimage.
hdfs dfsadmin -saveNamespace
Step 3: Bring the cluster back to production mode.
hdfs dfsadmin -safemode leave
Enabling NFS Gateway for HDFS Operations
When it comes to HDFS I/O operations, we either use HDFS API in a Java
program or hdfs dfs shell commands. If you wanted to interact with HDFS
FileSystem in the same way you interact with local FileSystems, while using the
same Linux commands for file operations, you could use the NFS Gateway from
Hadoop-2.4.1. Using the NFS gateway, we can mount the HDFS drive in the
local FileSystem and can interact in the same manner as the local FileSystem.
Following are the operations that you can perform when you use the NFS
gateway:

Browse HDFS FileSystem like local FileSystem.


Copy and paste file from local FileSystem to HDFS and viceversa.
Stream data directly to HDFS.

Lab 14 Setting up NFS Gateway to Access HDFS

When it comes to the NFS gateway in our Hadoop cluster, we can make our edge
server, NameNode server or any DataNode, an NFS gateway. However, in
production, we prefer to make our edge server the NFS gateway. Let us see how
to configure it.
Step 1: Stop all Hadoop services.
stop-all.sh
Step 2: Update/Add core-site.xml with the following parameters. If it already
exists, don’t make any changes.
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>hadoopvm</value>
</property>
Please note that in the above property names, hadoop is the username that has
access to HDFS operations as highlighted below:
hadoop.proxyuser.hadoop.groups hadoop.proxyuser.hadoop.hosts Step 3:
Update/Add hdfs-site.xml with the following parameters. If it already exists,
don’t make any changes.
<property>
<name>dfs.nfs3.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
<property>
<name>dfs.nfs.rtmax</name>
<value>1048576</value>
</property>
<property>
<name>dfs.nfs.wtmax</name>
<value>65536</value>
</property>
<property>
<name>dfs.nfs.exports.allowed.hosts</name> <value>*
rw</value>
</property>
Step 4: Start Portmap as a root user/mode and verify the same.
sudo -u root hadoop2/sbin/hadoop-daemon.sh start portmap sudo
jps
Step 5: Start NFS3 Gateway protocol as a normal user.
hadoop-daemon.sh start nfs3
Step 6: Create a mount directory and mount HDFS on that directory.
sudo mkdir /mnt/hdfs-cluster
sudo mount -t nfs -o vers=3,proto=tcp,nolock,noacl,sync
hadoopvm:/ /mnt/hdfs-cluster/
You have completed the process. Now let us test the setup.
Step 1: Traverse the mounted directory and observe the contents.
ls /mnt/hdfs-cluster
You will see that all directory structures of your current HDFS will be
reflected here. This shows that your setup is successful.
Step 2: Lets create a folder in HDFS in named prashantnfs in /(root) location
using NFS.
cd /mnt/hdfs-cluster
mkdir prashantnfs
Observe HDFS WebUI. You will see that the folder will be reflected in HDFS.
Step 3: Now let us copy one file and delete the same.
cp /home/hadoop/sample .
hadoop fs -ls /prashantnfs (You will see that the file is copied in HDFS)
rm -rf sample
hadoop fs -ls /prashantnfs (You will see that the file is deleted in HDFS)
Configuring DataNode Heartbeat Interval
Let us first understand the idea behind the heartbeat interval. DataNode notifies
the NameNode about its presence and its state through a heartbeat. The default
heartbeat interval is 3 seconds. However, the same is configurable. In case the
DataNode fails to send the heartbeat within 3 seconds frame, it does not mean
that the NameNode will declare the DataNode DEAD. In fact, the NameNode
takes a decision based on heartbeat re-check interval (default value: 10 minutes).
Thus, if any DataNode does not send any heartbeat within 10 minutes and 3
seconds, then the NameNode will mark that specific DataNode as DEAD. In real
production, sometimes we may need to change these configurations as per the
request of our Architect. The reasons can be many. Some of which are
application timeouts, limited network bandwidth and so on.
Lab 15 Setting up Datanode Heartbeat Interval

Let us learn how to set the heartbeat interval of 5 seconds and a heartbeat-
recheck interval of 15 minutes.
Step 1: Append the following properties in hdfs-site.xml of the NameNode
system i.e. node1.mylabs.com <property>
<name>dfs.heartbeat.interval</name>
<value>5</value>
</property>
<property>
<name>dfs.namenode.heartbeat.recheck-interval</name>
<value>900000</value>
</property>
In the above configuration, dfs.heartbeat.interval accept values in seconds
whereas dfs.namenode.heartbeat.recheck-interval values in milliseconds.
Step 2: Copy hdfs-site.xml in all the nodes participating in the cluster (Assuming
you are working on 5 node cluster).
On node1.mylabs.com perform the following commands:
scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node3.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node4.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node5.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml Setting up Quotas
In Hadoop, we can set two kinds of quotas:

a. File Quota
b. Space Quota

Let us see how to work with it.

a) Setting File Count Quota


In HDFS, this quota is officially called the Name Quota. The Name Quota limits
the count of files and directories. When it exceeds the count, the creation of files
and directories fails. Let us see an example of the same.
Lab 16 Setting up File Quota in HDFS

Step 1: Create a sample directory


hdfs dfs -mkdir /prashant
Step 2: Let us set the name quota as 3. Please note that even in this, the folder is
counted. If the requirement from the user/developer is 3, then you must set the
quota as 4 considering the 4th one is the directory itself on which you are
applying the quota.
hdfs dfsadmin -setQuota 3 /prashant
To check whether quota is applied or not type the following command:
hdfs dfs -count -q /prashant
The following output will be displayed:

Let us put the above data in a table so that it is easier to understand each
parameter.

By seeing the above table we can derive the formula,


File Quota = Remaining Quota + Directory Count + File Count
b) Clearing File Count Quota

Lab 17 Removing File Quota in HDFS

In case you want to remove the file quota, you can use the following command:
hadoop dfsadmin -clrQuota /prashant
To check whether the quota is removed or not, type the following command:
hdfs dfs -count -q /prashant
The following output will be displayed:
c) Setting Space Quota
Lab 18 Setting up Space Quota in HDFS

In order to limit the user with respect to the space consumption, you can use
Space Quota. Let us see an example to perform the Space Quota in the same
directory.
hadoop dfsadmin -setSpaceQuota 20M /prashant
The above command will set the quota of 20MB. Once it exceeds, it will give
the user an exception.
To check whether the quota is removed or not, type the following command:
hdfs dfs -count -q /prashant
The following output will be displayed:

Let us put the above data in a table so that it is easier to understand each
parameter.

d) Clearing Space Quota


Lab 19 Removing Space Quota in HDFS

To clear/reset the space limitation you can perform the following command:
hadoop dfsadmin -clrSpaceQuota /Prashant
To check whether the quota is removed or not, type the following command:
hdfs dfs -count -q /prashant
The following output will be displayed:
none inf none inf 1 2 36 /prashant
Enabling Recycle Bin in Hadoop HDFS (Trash Configuration)
As the name suggests, in this section, we will learn how to setup trash
configuration. This is beneficial when it comes to recovering the data that has
been deleted/archived accidently.
Lab 20 Configuring Trash Interval in HDFS and Recovering Data from
Trash

Let us see how to set this up. The assumption here is that you are performing this
demo on a MultiNode cluster. However, you can do the same on any kind of
cluster configuration.
Step 1: Configure core-site.xml of the NameNode machine. For me, it is
node1.mylabs.com. Append the following property in the configuration file: vi
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
Here, the value set is in minutes. So in this configuration we set the trash
interval as 7 days.
Step 2: Once the above configuration is done, you need to copy the file in other
systems participating in the cluster.
scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node2.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/core-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node3.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/core-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node4.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/core-site.xml scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node5.mylabs.com:/
/home/hadoop/hadoop2/etc/hadoop/core-site.xml Step 3: Restart the
NameNode service in node1.mylabs.com
hadoop-daemon.sh stop namenode
hadoop-daemon.sh start namenode
Step 4: Now let us test whether the trash is working or not. To do so, let us
upload one temporary file and delete the same.
hdfs dfs -mkdir /deletetest
hdfs dfs -put /home/hadoop/sample /deletetest/sample
hdfs dfs -ls -R /deletetest
hdfs dfs -rm -r /deletetest/sample
INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 10080 minutes, Emptier interval = 0 minutes.
Moved: ‘hdfs://hadoopvm:8020/deletetest/sample’ to trash at:
hdfs://hadoopvm:8020/user/hadoop/.Trash/Current In case you
want to recover the deleted file, you can do the same using the copy
command: hdfs dfs -cp
/user/hadoop/.Trash/Current/deletetest/sample /deletetest/sample
Summary

HDFS is an append-only FileSystem.


HDFS stands for Hadoop Distributed File System. It is the storage part of
Hadoop that allows Bigdata files to be stored across the cluster in a
distributed and reliable manner.
Its supports NFS Gateway to mount HDFS using NFSv3 protocol and use
it like a normal local FileSystem.
You can maintain two types of Quotas in HDFS. They are Name Quota
and Space Quota.
The default block size of the HDFS in Generation2 is 128MB and the
default replication factor is 3.
Chapter 4
Understanding YARN and Schedulers

In this chapter we will learn,

What is YARN
Need of YARN
Understanding YARN operation
Calculating and Configuring YARN parameters
Schedulers in YARN
Configuring Capacity Scheduler
Configuring Fair Scheduler

What is YARN?
YARN (Yet Another Resource Negotiator) is one of the most important
component in-terms of processing scalability. Ideally, in Generation1 Hadoop,
the only processing that was supported, was MR processing which is a batch-
oriented processing. However, when YARN was introduced, introduction of
other processing techniques became possible. YARN is a general purpose,
distributed, application management framework. YARN is a cluster management
framework which supports multiple java based processing frameworks like
Apache Spark, Apache Tez, Apache Flink, etc. Today, YARN is considered to be
a pre-requisite for Enterprise Hadoop Setup.
YARN enhances the capability of your existing Hadoop cluster by achieving:

Multi-tenancy - by allowing multiple java-based frameworks to use


HDFS as a storage layer for processing, interactive SQL and even for
real-time analysis.
Resource Management - by managing resources efficiently by
allocating cluster resources which improves the cluster utilization.
Compatibility - with legacy systems like MRv1 compatibility.

Need of YARN
The very first two questions generally arise, once people know about Hadoop
Generation1, are:
Why do we need YARN?
Why do we need a new resource management technique in a
distributed computing system?

Precisely, as the name YARN suggest, it’s just another resource management
framework. In Hadoop Gen1, there exists HDFS and MR as shown in figure

below:
HDFS, as discussed in previous chapter, refers to the storage. MapReduce is
the processing part of Hadoop. The services associated with MapReduce are
JobTracker and TaskTracker, where JobTracker is the master service and
TaskTracker is the slave service. Following are the operations done by
JobTracker:

JobTracker is the one who takes the processing request from the client.
JobTracker talks to the Namenode to get the metadata of the data which
is to be processed.
JobTracker figures out the best TaskTracker in the cluster for execution
based on data locality and available slots to execute a task of a given
operation.
JobTracker monitors the TaskTracker’s progress and reports back to the
client application.

Hence, one thing is clear - JobTracker handles Cluster Resource Management,


Monitoring Jobs, Managing Jobs, Job Allocation, Job Cleanup, Job Tracking,
Aggregating Outputs and more, which was a hectic task for the JobTracker. If
the load is high, there were high chances of JVM heap overflow issues resulting
in cluster crash. Thus, to tackle such kind of a situation, YARN came in picture.
In YARN, the idea is to have a global ResourceManager and a per-application
ApplicationMaster.

Understanding YARN Operation


YARN comprises of the two services namely ResourceManager and
NodeManager, where the ResourceManager is the master service used to manage
the use of resources across the Hadoop cluster and NodeManager is the slave
service used to create, manage and monitor containers. Let’s understand how
YARN works, Client submits the job to the ResourceManager. ResourceManager
will perform the following checks:

a. Understand the type of Job (MAPREDUCE, HIVE, SPARK, etc.)


b. Check the feasibility of execution depending on type of job. Ensures all
necessary auxiliary jars are present for executing that specific job.
c. Check for the availability of resources (CPU, RAM, DISK).

If the above three checks are positive, then Resource Manager will create an
ApplicationMaster specific to that application. It will also instruct the
NodeManager to create containers which will perform the execution operation
which will be supervised by that ApplicationMaster. Once the job execution is
completed, the ResourceManager will ensure to perform Resource De-allocation
and Garbage collection.

Configuring YARN Resources


This is one of the most crucial jobs of an administrator. Luckily, Hortonworks
Documentation has simplified this process by providing some formulae for the
same. Let’s go step-by-step to determine YARN and MR Memory
configurations.
Step 1: When determining the appropriate YARN and MapReduce memory
configurations for a cluster node, start with the available hardware resources.
Make a note in the following table,

RAM (Amount of memory)


CORES (Number of CPU Cores)
DISKS (Number of Disks)

Step 2: Calculate the Reserved Memory. Ideally reserved memory is the RAM
needed by the system processes and other Hadoop processes.
RESERVED MEMORY = Reserved for System memory + Reserved for
HBASE Memory (If HBase is on same node).
The values for system memory and HBase memory can be derived from the
table below:
Total Memory Recommended Reserved Recommended Reserved
per Node System Memory HBase Memory
4GB 1GB 1GB
8GB 2GB 1GB
16GB 2GB 2GB
24GB 4GB 4GB
48GB 6GB 8GB
64GB 8GB 8GB
72GB 8GB 8GB
96GB 12GB 16GB
128GB 24GB 24GB
256GB 32GB 32GB
512GB 64GB 64GB

Once you determine the RESERVED MEMORY, find the TOTAL


AVAILABLE MEMORY using the following formula: TOTAL AVAILABLE
MEMORY = RAM PER NODE – RESERVED MEMORY
Step 3: Determine the maximum number of containers allowed per node
Number of Containers = MIN ( 2 * CORES , 1.8 * DISKS, [Total Available
RAM] / MIN_CONTAINER_SIZE) where,
MIN_CONTAINER_SIZE can be derived from the following table:
Total RAM per Node Recommended Minimum Container Size
Less than 4GB 256MB
Between 4GB and 8GB 512MB
Between 8GB and 24GB 1024MB
Above 24GB 2048MB
Step 4: Calculate RAM per container
RAM-per-container = MAX (MIN_CONTAINER_SIZE , [Total Available
RAM] / No of Containers) Step 5: Based on all values calculated in the above
four steps, calculate the values for each property listed below based on the
following formulae: For yarn-site.xml
Property Name Property Value (Formula)
yarn.nodemanager.resource.memory-mb no of containers * RAM per container
yarn.scheduler.minimum-allocation-mb RAM per container
yarn.scheduler.maximum-allocation-mb no of containers*RAM per container

For mapred-site.xml
Property Name Property Value (Formula)
mapreduce.map.memory.mb RAM per container
mapreduce.reduce.memory.mb 2 * RAM per container
mapreduce.map.java.opts 0.8 * RAM per container
mapreduce.reduce.java.opts 0.8 * 2 * RAM per container
yarn.app.mapreduce.am.resource.mb 2 * RAM per container
yarn.app.mapreduce.am.command-opts 0.8 * 2 * RAM per container

Let’s solve a problem statement to understand the implementation of the


formulae given above.
Consider a cluster having all nodes with the following specifications: CPU Cores
= 12
RAM = 64GB
No of disks = 10
The cluster will HBase installed.
Solution
Step 1: Start with the available hardware resources.

RAM (Amount of memory) = 64GB


CORES (Number of CPU Cores) = 12
DISKS (Number of Disks) = 10

Step 2: Calculate the Reserved Memory.


RESERVED MEMORY = Reserved for System memory + Reserved for
HBASE Memory (If HBase is on same node) Referring to the table, substituting
the value
RESERVED MEMORY = 8 + 8 = 16GB
Once you determine the RESERVED MEMORY, find the TOTAL
AVAILABLE MEMORY using the following formula, TOTAL AVAILABLE
MEMORY = RAM PER NODE – RESERVED MEMORY
= 64 - 16
= 48GB
TOTAL AVAILABLE MEMORY (RAM) = 48GB
Step 3: Determine the maximum number of containers allowed per node Number
of Containers = MIN ( 2 * CORES , 1.8 * DISKS, [Total Available RAM] /
MIN_CONTAINER_SIZE) = MIN (2* 12 , 1.8 * 10, [48/2]) ---- Note: Here
MIN_CONTAINER SIZE represented in GB
= MIN ( 24, 18, 24 )
= 18
Number of containers = 18
Step 4: Calculate RAM per container.
RAM-per-container = MAX (MIN_CONTAINER_SIZE , [Total Available
RAM] / No of Containers) = MAX (2, (48/18))
= MAX (2 , 2.66)
= 2GB
Step 5: Based on all values calculated in the above four steps, calculate the
values for each property listed below based on the formula, For yarn-site.xml
Property Name Property Value (Formula)
yarn.nodemanager.resource.memory-mb no of containers * RAM per container
= 18 * 2
= 36GB
= 36 * 1024
= 36864MB
yarn.scheduler.minimum-allocation-mb RAM per container
= 2GB
= 2048MB
yarn.scheduler.maximum-allocation-mb no of containers*RAM per container
= 18 * 2
= 36GB
= 36 * 1024
= 36864MB

For mapred-site.xml
Property Name Property Value (Formula)
mapreduce.map.memory.mb RAM per container
= 2GB
= 2048MB
mapreduce.reduce.memory.mb 2 * RAM per container
= 2 * 2
= 4GB
= 4096MB
mapreduce.map.java.opts 0.8 * RAM per container
= 0.8 * 2
= 1.6GB
= 1638MB
mapreduce.reduce.java.opts 0.8 * 2 * RAM per container
= 3.2GB
= 3276MB
yarn.app.mapreduce.am.resource.mb 2 * RAM per container
= 2 * 2
= 4GB
= 4096MB
yarn.app.mapreduce.am.command-opts 0.8 * 2 * RAM per container
= 3.2GB
= 3276MB

So, following will be appended in yarn-site.xml:


vi /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml <property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>36864</value> </property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value> </property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>36864</value> </property>
vi /home/hadoop/hadoop2/etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.map.memory.mb </name>
<value>2048</value> </property>
<property>
<name>mapreduce.reduce.memory.mb </name>
<value>4096</value> </property>
<property>
<name>mapreduce.map.java.opts</name>
<value>1638</value> </property>
<property>
<name>mapreduce.reduce.java.opts </name>
<value>3276</value> </property>
<property>
<name>yarn.app.mapreduce.am.resource.mb </name>
<value>4096</value> </property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>3276</value> </property>
Schedulers in YARN
The role of schedulers is to allocate the resources to the applications based on
the policy defined by the administrator. There exist three schedulers, viz.:

a. FIFO
b. Capacity Scheduler
c. Fair Scheduler

The default scheduler in Hadoop Gen1 is FIFO and in Hadoop Gen2, it’s
Capacity Scheduler. FIFO scheduler is something which does not require any
specific configuration. Its role is to maintain the queue of all job submissions to
be done and to run the job one by one, on the first come first serve basis. FIFO
scheduler is not designed for the multi-tenant Hadoop environments. For multi-
tenant environments, there exists a CapacityScheduler and a Fair Scheduler. In
our book, we will explore Capacity and Fair Scheduler configurations.
Before implementing schedulers, let’s first create multiple users and groups in
the system.

Creating Users and Groups in System


Lab 21 Creating Multiple Users and Groups in Ubuntu System Problem
Statement: Create users and groups as per the following table,

Groups Users
Default user1
Sales user2
Analytics user3

Solution
Step 1: Create Groups
sudo groupadd default
sudo groupadd sales
sudo groupadd analytics
Step 2: Create users and link the same with their groups sudo useradd user1 -G
default sudo useradd user2 -G sales
sudo useradd user3 -G analytics Understanding and Implementing
Capacity Scheduler Capacity Scheduler allows multiple users to securely
share a large cluster in terms of resource allocation based on the constraints
set by the administrator. The core abstraction entity of capacity scheduler is
queues. It’s the admin’s responsibility to create the queue as per the
business requirement to use the cluster resources.
Lab 22 Setting up Capacity Scheduler in YARN

Problem Statement: Configure your cluster to work on CapacityScheduler.


Create two queues, namely default, analytics and sales, where total capacity used
by each queue will be: default – 40%
analytics – 40%
sales – 20%
Ensure that the user1 is mapped with the default queue, user2 is mapped with
the sales queue and user3 is mapped with the analytics queue.
Solution
When it comes to the CapacityScheduler queues, the top-level queue node is
called root and the sub-queue node comes under the root node as shown in the

diagram below:
Step 1: Ensure that the Hadoop services are live and active Step 2: Configure
CapacityScheduler in yarn-site.xml <property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Ca
</property>
Step 3: Configure scheduler queue in the capacity-scheduler.xml as per the table
shown below: vi /home/hadoop/hadoop2/etc/hadoop/capacity-scheduler.xml
<configuration>
<!-- Initial Configuration --> <property>
<name>yarn.scheduler.capacity.maximum-
applications</name> <value>10000</value> </property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-
percent</name> <value>0.1</value> </property>
<property>
<name>yarn.scheduler.capacity.resource-
calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalcu
</property>
<!-- Creating 3 queues i.e. default, sales and analytics -->
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default,sales,analytics</value> </property>
<!-- Setting default queue parameters --> <property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>40</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.user-limit-
factor</name> <value>1</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.maximum-
capacity</name> <value>100</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.state</name>
<value>RUNNING</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
<value>*</value> </property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_administer_queue</
<value>*</value> </property>
<!-- End for default queue --> <!-- Setting sales queue parameters
--> <property>
<name>yarn.scheduler.capacity.root.sales.capacity</name>
<value>20</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.user-limit-
factor</name> <value>1</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.maximum-
capacity</name> <value>100</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.state</name>
<value>RUNNING</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.acl_submit_applications</
<value>*</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.acl_administer_queue</na
<value>*</value> </property>
<!-- End for sales queue --> <!-- Setting analytics queue
parameters --> <property>
<name>yarn.scheduler.capacity.root.analytics.capacity</name>
<value>40</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.user-limit-
factor</name> <value>1</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.maximum-
capacity</name> <value>100</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.state</name>
<value>RUNNING</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.acl_submit_applications</
<value>*</value> </property>
<property>
<name>yarn.scheduler.capacity.root.sales.acl_administer_queue</na
<value>*</value> </property>
<!-- End for analytics queue --> <!—User/Group to Queue
Mapping --> <property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value>g:sales:sales,g:default:default,g:analytics:analytics</value>
</property>
<!-- End for user to queue mapping --> <property>
<name>yarn.scheduler.capacity.queue-mappings-
override.enable</name> <value>false</value>
</property>
<property>
<name>yarn.scheduler.capacity.node-locality-
delay</name> <value>40</value> </property>
</configuration>
Step 4: Refresh the Queue settings in real-time. Perform the following command
in ResourceManager system.
yarn rmadmin -refreshQueues Step 5: Check the list of Queue,
configured in the system.
mapred queue -list
17/07/04 00:42:18 INFO client.RMProxy: Connecting to
ResourceManager at /0.0.0.0:8032
======================
Queue Name : default
Queue State : running
Scheduling Info : Capacity: 40.0, MaximumCapacity: 100.0,
CurrentCapacity: 0.0
======================
Queue Name : sales
Queue State : running
Scheduling Info : Capacity: 20.0, MaximumCapacity: 100.0,
CurrentCapacity: 0.0
======================
Queue Name : analytics
Queue State : running
Scheduling Info : Capacity: 40.0, MaximumCapacity: 100.0,
CurrentCapacity: 0.0
Step 5: Execute an MR application to verify whether the request is managed by
the specific queue or not.
sudo -u user1 hadoop2/bin/hadoop jar
hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.3.jar wordcount /data1/emp /TestOP/WC1
sudo -u user2 hadoop2/bin/hadoop jar
hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.3.jar wordcount /data1/emp /TestOP/WC1
sudo -u user3 hadoop2/bin/hadoop jar
hadoop2/share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.3.jar wordcount /data1/emp /TestOP/WC1
Based on the configuration done, the request generated by user1 must be
managed by default, that by user2 must be managed by sales and the same
generated by user3 must be managed by analytics queue.

Understanding and Implementing Fair Scheduler Fair Scheduler


enables the cluster to allocate resources fairly based on the queues
and the weights assigned. In this section, we will learn how to
configure the Fair Scheduler.
Lab 23 Setting up Fair Scheduler in YARN

Step 1: Ensure that the Hadoop services are live and active.
Step 2: Configure FairScheduler in yarn-site.xml
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fa
</property>
<property>
<name>yarn.scheduler.fair.allocation.file</name>
<value>/home/hadoop/hadoop2/etc/hadoop/fair-
scheduler.xml</value> </property>
Step 3: Create fair-scheduler.xml and configure it for the users with queue and
weight.
vi /home/hadoop/hadoop2/etc/hadoop/fair-scheduler.xml <?xml
version=”1.0”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<allocations>
<queue name=”prod”>
<aclAdministerApps>hadoop</aclAdministerApps>
<aclSubmitApps>hadoop</aclSubmitApps> <minResources>1000
mb,0vcores</minResources> <maxResources>6000
mb,100vcores</maxResources>
<maxRunningApps>20</maxRunningApps>
<weight>1.0</weight> <schedulingPolicy>fair</schedulingPolicy>
</queue>
<queue name=”dev”>
<aclAdministerApps>user1</aclAdministerApps>
<aclSubmitApps>user1</aclSubmitApps> <minResources>1000
mb,0vcores</minResources> <maxResources>6000
mb,100vcores</maxResources>
<maxRunningApps>20</maxRunningApps>
<weight>1.0</weight> <schedulingPolicy>fair</schedulingPolicy>
</queue>
<user name=”hadoop”>
<maxRunningApps>5</maxRunningApps> </user>
<user name=”user1”>
<maxRunningApps>1</maxRunningApps> </user>
<user name=”user2”>
<maxRunningApps>1</maxRunningApps> </user>
<userMaxAppsDefault>5</userMaxAppsDefault> </allocations>
Step 4: Refresh the Queue settings in real-time. Perform the following command
in ResourceManager system.
yarn rmadmin -refreshQueues In case the Queue is not refreshed, you
can restart the ResourceManager service and you will observe in the WebUI
under Schedulers tab that the FairScheduler has been set successfully.

Summary

YARN is a general purpose, distributed, application management


framework.
YARN comprises of two services, namely ResourceManager and
NodeManager where the ResourceManager is the master service used to
manage the use of resources across the Hadoop cluster and NodeManager
is the slave service used to create, manage and monitor the containers.
The default scheduler in Hadoop Gen1 is FIFO and in Hadoop Gen2, its
Capacity Scheduler.
Chapter 5
HDFS Federation and Upgrade

In this chapter we will learn,

HDFS Federation and its need.


Implementing Federation in a 4 node cluster.
Implementing viewFS in a federated cluster.
Understanding HDFS upgrade.
Implementing Upgrade from Gen1 to Gen2.
Implementing Upgrade from Hadoop-2.7.3 to Hadoop-2.8.0 in an HA-
enabled cluster.

Introduction
HDFS has two layers. They are Namespace and Block Management layer.
Namespace is responsible for maintaining the directory structure and metadata of
the data that is stored in the storage layer. The block management layer is
responsible for maintaining the DataNode membership information, storage
operations like create, delete, modify, and fetching locations of blocks associated
with the file. It also supports replication and replica placement during the HDFS
write operation.

Concerns with Existing HDFS Architecture

a. Tight coupling between Namespace and block management layer.


b. Only one NameNode supported per cluster.
c. No multi-tenancy support, so less ROI.

The biggest problem was that there existed only one namespace and one block
management. Both of these are tightly coupled with each other. If any of the
components were to fail, the entire cluster would go down. The second
predominant concern in this architecture was that only one NameNode system
can exist because of this tight coupling. The third concern was Isolation. Ideally,
any large production cluster (1000+ nodes) is expected to have some multi-
tenant environment so that multiple organizations/groups/teams can share the
cluster resulting in more ROI (Return on Investment). However since there
exists only one NameNode, this is something that is unachievable.

Understanding HDFS Federation


HDFS Federation is a concept introduced in Hadoop Generation 2 with an
intention to solve the above problems. HDFS Federation supports multiple
Namespaces in the same cluster to provide scalability and isolation of the
NameNode machine. Ideally, this concept was released keeping the NameNode
service in mind. HDFS federation allows for NameNode scaling where each
NameNode machine will be isolated from other NameNodes. The DataNodes
will register themselves with all the NameNodes participating in the cluster by
periodically sending the heartbeat to each NameNode. The block diagram below
depicts HDFS Federation.

Let us understand each term associated with Federation.

a. Block Pool – It is a set of blocks that belong to a single NameNode


Namespace.
b. Namespace – It is a term used to represent the directory structure and
metadata of the data that is stored in the storage layer.
c. Namespace Volume – The set of Namespace and block pool of a
NameNode in the cluster is considered as Namespace Volume.
d. ClusterID – It is an identifier used to identify all the nodes in the cluster.
While formatting multiple NameNodes, the cluster ensures that the
clusterID you set is same. Else the node will not participate in the
cluster.

In the above block diagram, there exist two Namespace volumes. They are
nn1.mylabs.com and nn2.mylabs.com that hold their own independent
namespace and block pool and are managed independent of other block pools
available in the cluster. In this setup, when you delete a Namespace, the
associated block pool will also get deleted and the DataNode will delete the
blocks. However, this will not affect the operations of other Namespace
volumes.
Following is the benefit of this kind of setup:

a. It supports scalability of the NameNode and isolation resulting in multi-


tenancy.
b. It opens a channel towards a generic storage service in each NameNode.
For example, NFS gateway can be applied to Namenode1 which will be
independent of Namenode1.

Let us see how we can practically implement Federation.

Setting up HDFS Federation


Lab 24 Setting up HDFS Federation in a 4 Node Cluster

Problem Statement1: Create a 4 node cluster setup from scratch with the
following specification:
Solution
Understand your setup and fill in the table given below. Ensure that in Step 1
(under hosts file configuration) you add the IP address of your machine.
Node Desired IP Address for Example Your Machine/VMs IP
Name Hostname Purpose address
Node1 nn1.mylabs.com 192.168.1.1
Node2 nn2.mylabs.com 192.168.1.2
Node3 dn1.mylabs.com 192.168.1.3
Node4 dn2.mylabs.com 192.168.1.4

Step 1: Setup network Hostname and host file configuration on all 4 machines.
In a real production setup, we use a dedicated DNS server for Hostnames.
However, for our lab setup, we will setup local host files as resolvers. Hadoop
ideally recommends working on Hostname rather than IP addresses for host
resolution.
Perform the following steps in each machine.
sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com nn1.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.

Once the configuration is done, you will need to restart the machines for
Hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
Step 2: Setup SSH password-less setup between your NameNode system and the
DataNode system. In our example setup, we need an SSH setup password-less
configuration between nn1.mylabs.com to the DataNodes participating in the
cluster, and nn2.mylabs.com to the DataNodes participating in the cluster. This is
done so that node1 can contact other nodes to invoke the services. The reason for
this kind of configuration is that the NameNode system is the single point of
contact for all the users and administrators.
Perform the following commands on nn1.mylabs.com and nn2.mylabs.com:

a. Generate SSH keys .


ssh-keygen (Press Enter till the command gets completed. You do not
need to fill anything including the password, since we need to setup a
‘password-less’ key)
This command will generate the public key(id_rsa.pub) and the private
key(id_rsa) in the home folder under .ssh directory.
b. Register the public key in the node1, node2, node3, and node4.
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@dn1.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub hadoop@dn2.mylabs.com
Step 3: Setup Hadoop on all four nodes in Standalone mode.
Step 4: Setup core-site.xml in all four nodes
vi /home/hadoop/hadoop2/etc/hadoop/core-site.xml <?xml
version=”1.0” encoding=”UTF-8”?> <?xml-stylesheet
type=”text/xsl” href=”configuration.xsl”?> <configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdfsdrive</value> </property>
</configuration>
Step 5: Setup hdfs-site.xml in Node1 to configure distributed storage (HDFS)
configurations and copy the same in other nodes that are participating in the
cluster.
vi /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml <?xml
version=”1.0” encoding=”UTF-8”?> <?xml-stylesheet
type=”text/xsl” href=”configuration.xsl”?> <configuration>
<property>
<name>dfs.replication</name> <value>1</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.nn1</name>
<value>nn1.mylabs.com:9000</value> </property>
<property>
<name>dfs.namenode.rpc-address.nn2</name>
<value>nn2.mylabs.com:9000</value> </property>
<property>
<name>dfs.namenode.http-address.nn1</name>
<value>nn1.mylabs.com:50070</value> </property>
<property>
<name>dfs.namenode.http-address.nn2</name>
<value>nn2.mylabs.com:50070</value> </property>
</configuration>
In the above configuration, we are informing the cluster about the 2
NameNodes - nn1 and nn2 that are mapped with nn1.mylabs.com and
nn2.mylabs.com. We have also setup the ports for http and rpc communication.
Step 6 : Setup Slave configuration in the cluster. You need to perform this step in
Node1 and Node2. This configuration informs the cluster which nodes will have
the slave services (DataNode and NodeManager). Usually, this configuration is
placed only in the NameNode machine.
vi /home/hadoop/hadoop2/etc/hadoop/slaves #Delete localhost and
replace the same with node3 and node4 as shown dn1.mylabs.com
dn2.mylabs.com
Step 7: Format the NameNode. You need to perform this step on both
NameNodes independently since this command will ensure that it creates
multiple Namespace volumes isolated with each other. This setup ensures that
Hadoop will create a FileSystem in the location that is defined in
hadoop.tmp.dir of core-site.xml. In our case, it is /home/hadoop/hdfsdrive.
It is recommended that you check the Hadoop version prior to your formatting
as a Best Practice. This practice will enable you to be conscious with the Hadoop
version that the system has accepted. This is required when you are connected to
PuTTY using SSH, since sometimes the environment may get updated but the
session may not get refreshed. In case the session does not get refreshed, it is
recommended that you re-login the SSH shell.
hadoop version
On nn1.mylabs.com,
hdfs namenode -format -clusterId federationDemo On
nn2.mylabs.com,
hdfs namenode -format -clusterId federationDemo Step 7: Let us
check the setup. Let us create two directories in two different NameNodes
hdfs dfs -mkdir hdfs://nn1.mylabs.com:9000/sales hdfs dfs -mkdir
hdfs://nn2.mylabs.com:9000/analytics hdfs dfs -ls
hdfs://nn1.mylabs.com:9000/
The result for the above command will have only one directory named sales
hdfs dfs -ls hdfs://nn2.mylabs.com:9000/
The result for the above command will have only one directory named
analytics Congratulations. You have successfully implemented federation in your
cluster. Now if you observe this cluster from the point of view of the admin, you
will see that there exists a single point of failure issue in each NameNode and no
failover mechanism. We can convert this existing cluster into an HA-enabled
cluster that we will learn in the next chapter.

Understanding and Implementing ViewFS in Existing HDFS


Federated Cluster viewFS is a type of client-side mount table
which is completely implemented in-memory on the client side to
provide a customized view of filesystem namespace. Ideally,
viewFS is specified with the URI, “viewfs:///”
Lab 25 Implementing viewFS in Existing 4 Node Federated Cluster

Implement viewFS with the following specification


viewFS point Actual HDFS mount point
/production hdfs://nn1.mylabs.com/prod
/testing hdfs://nn2.mylabs.com/test

We will be using the above created federated cluster for this lab.
Step 1: Create two folders in the root location of HDFS named prod and test in
nn1.mylabs.com and nn2.mylabs.com respectively as shown below: On
nn1.mylabs.com,
hdfs dfs -mkdir hdfs://nn1.mylabs.com:9000/prod On
nn2.mylabs.com,
hdfs dfs -mkdir hdfs://nn2.mylabs.com:9000/test Step 2: Stop the
HDFS service
stop-dfs.sh
Step 3: Setup core-site.xml which will hold the configurations for viewFS. Once
configured, replicate the same in all other nodes participating in the cluster.
On nn1.mylabs.com,
vi /home/hadoop/hadoop2/etc/hadoop/core-site.xml <!-- Append
the following properties within the configuration tag -->
<property> <name>fs.default.name</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./production</name>
<value>hdfs://nn1.mylabs.com:9000/prod</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./testing</name>
<value>hdfs://nn2.mylabs.com:9000/test</value>
</property>
Save the file. Now copy this file in other nodes
On nn1.mylabs.com,
scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@nn2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@dn1.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@dn2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml Step 4: Start HDFS services
start-dfs.sh
Step 5: Check whether viewFS is working or not,
hdfs dfs -ls /production
hdfs dfs -mkdir /production/demo1
hdfs dfs -mkdir /testing/demo2
hdfs dfs -ls /testing
So, you will observe that physically, demo1 folder will be placed and
maintained by nn1.mylabs.com and demo2 will be maintained by
nn2.mylabs.com. However using viewFS, we get a transparency.

Understanding HDFS upgrade


One of the very common administration tasks is to upgrade the cluster once in
every four to six months. HDFS upgrade simply means changing the version of
Hadoop from one version to another. In this section we will learn about two
kinds of upgrade:

a. Upgrading from Gen1 to Gen2 (hadoop-1.2.1 to hadoop-2.7.3)


b. Upgrading from hadoop-2.7.3 HA-enabled cluster to hadoop-2.8.0 HA
enabled cluster.

Ideally a major upgrade introduces a downtime in your cluster. However from


hadoop-2.4.1 and later, there exists a rolling upgrade in which there is no
downtime if your existing cluster is in HA mode. In this chapter, we will limit
our discussion to upgrading Gen1 to Gen2. We will cover rolling upgrade in the
next chapter because for a rolling upgrade we will need to know how to convert
a normal cluster into an HA-enabled cluster. We will discuss that in the next
chapter.

Performing HDFS Upgrade from Gen1 to Gen2


Lab 26 Performing Hadoop Upgrade from Gen1(1.2.1) to Gen2(2.7.3)
Problem Statement2: Upgrade your existing Gen1(1.2.1) cluster into
Gen2(2.7.3) cluster with the following specification.

Existing Cluster (hadoop- After Upgrade (hadoop-


Node Hostname
1.2.1) 2.7.3)
node1.mylabs.com NameNode NameNode
JobTracker ResourceManager
node2.mylabs.com SecondaryNameNode SecondaryNameNode
node3.mylabs.com DataNode DataNode
TaskTracker NodeManager
node4.mylabs.com DataNode DataNode
TaskTracker NodeManager

The diagram below depicts the existing cluster. You can create your 4 node
Gen1 cluster by referring my blog @ http://bigdataclassmumbai.com

Step 1: Ensure that there are no existing pending upgrades active on the system.
Please note that your Hadoop HDFS services are online and running before
performing the following command: hadoop version (The output should be
hadoop-1.2.1) hadoop dfsadmin -upgradeProgress status Step 2: Maintain the
report of the current HDFS
hadoop fsck / -files -blocks -locations > dfs-v-old-fsck-1.log
hadoop dfs -lsr / > dfs-v-old-lsr-1.log hadoop dfsadmin -report >
dfs-v-old-report-1.log Step 3: Stop Hadoop services
stop-all.sh
Step 4: Install Hadoop 2.7.3 as per the instruction given in Setting up Hadoop
in Distributed Mode (Multinode Cluster) in Hadoop Installation and
Deployment chapter from Step 3 to Step 8. Ensure you give different port
numbers in fs.default.name and other parameters. The reason why we are
installing hadoop-2.7.3 is to further perform rolling upgrade where we will
upgrade from 2.7.3 to 2.8.1.
Step 5: Start the Upgrade Process
hadoop version
Ensure the version shown is 2.7.3. If you are using PuTTY or a similar
terminal client, ensure you restart the session and then retry. Once you ensure the
version of Hadoop is correct, initiate the upgrade using the following command:
hadoop-daemon.sh start namenode -upgrade This step will only upgrade the
index and not the block of the metadata. Hadoop will start the upgrade and the
NameNode service in Safemode Step 6: Perform the following command:
hadoop dfs -lsr / > dfs-v-new-lsr-0.log Compare it with the dfs-v-new-
lsr-0.log file to ensure none of the metadata is lost. Once confirmed, you can
start the HDFS service to ensure data block reports are in sync with the new
cluster.
Step 7: Start the HDFS cluster
start-dfs.sh
Note that your NameNode service will already be running. You will observe
that the DataNodes start exchanging the block information with the NN. The NN
will be out of Safemode within 30 seconds subject to the network bandwidth.
Perform the relevant checks.
hdfs dfsadmin -report > dfs-v-new-report-1.log hdfs dfs -lsr / >
dfs-v-new-lsr-1.log Compare this with the old files.
Step 8: Once assured that all your files are available and in compliance with the
previous data structure, finalize the upgrade using hdfs dfsadmin -
finalizeUpgrade

Summary

HDFS federation allows NameNode scaling where each NameNode


machine will be isolated from other NameNodes.
A major upgrade introduces downtime in your cluster. However from
hadoop-2.4.1 and later, there exists a rolling upgrade that contains no
downtime if your existing cluster is in HA mode.
Chapter 6
Apache Zookeeper Admin Basics

In this chapter, you will learn:

What is Zookeeper?
Zookeeper in terms of Hadoop and its Ecosystem
How Zookeeper Ensemble initialize and work
Setting up Zookeeper in Standalone mode
Setting up Zookeeper in Leader-Follower mode
Reading and Writing in Zookeeper

What is Apache Zookeeper?


Apache Zookeeper is a coordinator service that can be used in distributed
computing applications. Zookeeper is designed keeping the known problems of
distributed systems in mind. It can perform tasks like:

a. Configuration Management
b. Naming Service
c. Synchronization
d. Group Services

This chapter is set to explain the basics of Zookeeper and how it works at a
basic level. We will also see the uses of Zookeeper in Hadoop HA setup and
HBase setup.
Let us now start with a basic understanding of Zookeeper design in a nutshell.

Zookeeper Design
The following diagram shows the working of a typical Zookeeper cluster:
This setup contains a 3 node Zookeeper cluster. Zookeeper can be configured
to work in two modes:

a. Standalone mode – In this mode, Zookeeper is installed in a single


machine.
b. Leader-Follower Mode – In this mode, Zookeeper is installed in
multiple machines such that one of the machines will be the leader while
the others will be followers.

The above diagram is of a Zookeeper setup in Leader-Follower mode also


known as Zookeeper Ensemble. Zookeeper maintains the state information of
the distributed applications as in-memory storage, along with transaction logs
and snapshots in a persistent storage. If majority of the servers are available
(minimum 2), then Zookeeper service will be available.
The client application can write the state or other related information in
Zookeeper by connecting to any other Zookeeper node. However, the ‘write’
operation is done by the leader and the same is broadcasted to other follower
nodes for HA. In case the client application is connected to a follower, and
performs write operation, the same write operation will be forwarded to the
leader and broadcasted to the followers. Zookeeper maintains the client
application database in ZNodes.
ZNodes is a hierarchical data structure. It can ideally be of three types,
namely, Ephermal ZNodes, Sequential ZNodes, and Persistant ZNodes. While
implementing Zookeeper cluster for Hadoop HA requirement, ZNodes uses
Zookeeper to perform failure detections and active NameNode election. For
HBase, it uses Zookeeper for co-ordination between HMaster and
HRegionServer. HBase services use Zookeeper to share state information. In
both cases, i.e. Hadoop and HBase, ZNodes use Ephermal ZNodes for storing
and sharing the data.
The biggest question now arises, “How many Zookeeper servers can be used
in a Zookeeper Ensemble?” The recommended number of Zookeeper servers in
an ensemble is 5. You can, however, do the planning using this formula:
(2*nF)+1 = N
Where:
nF is the number of Failures the Ensemble must tolerate.
N is the number of Zookeeper Servers in an Ensemble.
So when we say 5, the cluster can tolerate up to 2 Failures with a majority as
3. The majority factor is essential for Zookeeper election process to decide
which Zookeeper server will be the leader. The reason why we always consider
Odd numbers is to maintain the majority factor.

How Zookeeper Ensemble Initializes and Works?


Let us understand how the Zookeeper Ensemble initializes. When we setup a
fresh Zookeeper Ensemble, we set some RPC ports in a zoo.cfg configuration
file. Let us understand the communication pattern in Zookeeper basis the
diagram shown below:

When we start the services in all Zookeeper servers, the service will initiate an
election with each other to figure out which Zookeeper server in the Ensemble
will become the leader. The leader will perform the write operation and will also
broadcast the write in multiple Zookeeper servers. The election is done using the
port 3888 (ideal). This port configuration is done in server.x parameter as
defined in Step 3 of Problem Statement 1. An example of the server
configuration is as follows: server.1=node1.mylabs.com:2888:3888
Where:
2888 is meant for internal broadcast communications between the leader
Zookeeper Server to multiple follower Zookeeper Servers.
3888 is the port defined for Leader Election.
When a client connects to Zookeeper, it connects through clientPort, which is
by default 2181. Using this port, the client will write the data in the Zookeeper
in-memory store which will be saved as a snapshot and logged in the persistent
storage as specified in dataDir property.. If the write is done in leader, the leader
will broadcast the same to all followers. If the client connects to a follower and
performs write, the follower will forward the same to the leader and the leader
will perform the task of broadcasting information.
In case the leader dies due to a planned or unplanned event, the other
Zookeeper servers will initiate the election to choose another leader. Technically,
Zookeeper data is highly available due to broadcasting and is also fault-tolerant.
This is possible since another server will take over the charge if the current
leader dies.

Implementing Zookeeper in Standalone Mode


Lab 27 Setting up Zookeeper in Standalone Mode

Problem Statement 1 – Setup Zookeeper in Standalone mode We will set up


a Zookeeper standalone cluster. The intention of this problem statement is
to make you comfortable with the configurations and working.
Solution
Step 1: You may use your Pseudo-distributed Hadoop machine for performing
this practically. Download Zookeeper using the following command. By default
the tar file will get installed in/home/hadoop location.
wget http://www-us.apache.org/dist/zookeeper/zookeeper-
3.4.6/zookeeper-3.4.6.tar.gz
Step 2: Extract the tar file in the home folder.
tar -xvzf zookeeper-3.4.6.tar.gz
Step 3: Setup Environment Variable for Zookeeper.
vi /home/hadoop/.bashrc
#Add the following lines at the beginning of the file.
export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.4.6
export PATH=$PATH:$ZOOKEEPER_HOME/bin Once you save
the .bashrc file, update the environment variable file using the following
command: exec bash
Step 3: Configure Zookeeper to work on Standalone mode.
vi /home/hadoop/zookeeper-3.4.6/conf/zoo.cfg tickTime=2000
clientPort=2181
initLimit=5
syncLimit=2
dataDir=/home/hadoop/zookeeper-3.4.6/data
dataLogDir=/home/hadoop/zookeeper-3.4.6/logs
server.1=node1.mylabs.com:2888:3888
The above is the minimum recommended configuration required to spin up a
Zookeeper cluster. Let us understand each configuration in further detail.
Parameter
Explanation
Name
tickTime tickTime is the Zookeeper heartbeat time represented in ‘tick’.
Zookeeper follows ‘tick’ as a standard time convention for other
configurations like initLimit, syncLimit, and so on. Currently we
are configuring it as 2000 which represent 1 tick = 2000
milliseconds. When it comes to the Zookeeper heartbeat
mechanism, the minimum session timeout is 2 ticks i.e. 4000
milliseconds = 4 seconds
clientPort This is the RPC port for external communication. This is used by
Zookeeper clients to interact with Zookeeper Ensemble.
initLimit It is the amount of time in tick to allow follower nodes to connect
with the leader node and perform synchronization. Ideally, the
initLimit time must be higher if the amount of data managed by
Zookeeper is large. This parameter is meant for initial/first-time
connections between follower and leader. In our above
configuration, initLimit is 5 ticks which means the time is 5*2000
milliseconds = 10000 milliseconds = 10 seconds
syncLimit It is the amount of time in tick to allow follower nodes to connect
with the leader node and perform synchronization. This value is
used after the initialization of the follower. During initialization, it
uses initLimit. After that it uses syncLimit. In our configuration,
we set syncLimit as 2 ticks which means 2*2000 milliseconds =
4000 milliseconds = 4 seconds
dataDir The local folder where Zookeeper stores the in-memory store
snapshots and transaction logs of updates done in the store. In
production, we prefer to have a separate location/drive for
maintaining snapshots and transaction logs. dataDir ensures that
the snapshots of in-memory store shall persist. For transaction
logs, you can use dataLogDir as explained in the next parameter
dataLogDir This is a local folder used to maintain the transaction logs of
updates done in the in-memory store of the Zookeeper
server.x This parameter is used to setup Zookeeper Ensemble. As explained
before, Ensemble is a term to define a Zookeeper cluster. In our
current setup, we have configured only one server for standalone.
You will see this in action in our Multinode Zookeeper installation

Step 4: Create necessary folders for Zookeeper and assign ownership.


mkdir -p /home/hadoop/zookeeper-3.4.6/data
mkdir -p /home/hadoop/zookeeper-3.4.6/logs Here, the data folder
will be used to store snapshots of the Zookeeper in-memory store and the
data of the Zookeeper. The logs folder will be used to maintain transaction
logs of the updates done in the in-memory store of Zookeeper.
Step 5: Create myid file in data folder (the value for myid will be 1 for us).
vi /home/hadoop/zookeeper-3.4.6/data/myid 1
We have configured server.1 in zoo.cfg file. Here, the myid file represents the
unique id of the Zookeeper server. The myid file is stored in the data folder of
the Zookeeper server.
Step 6: Start Zookeeper Service using the following command:
zkServer.sh start
Once started, check whether the service has started or not. To do so type: jps
If you get QuorumPeerMain service in the list, it means that the Zookeeper
service is running. To check the states of Zookeeper (Standalone in our case)
perform the following command: zkServer.sh status
If it shows the following, it means your configuration is successful.
JMX enabled by default
Using config: /home/hadoop/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: standalone
Implementing Zookeeper in Leader-Follower Mode
Lab 28 Setting up Zookeeper in Leader-Follower mode

Problem Statement 2 – Setting up Zookeeper Ensemble (Leader-Follower


Mode) In this use-case, we will see how to setup a 3 node Zookeeper
Ensemble. We will also use this setup in our HDFS and YARN High
Availability setups.
Solution
Following is the type of setup we are going to perform:

Step 1: You may use your MultiNode Hadoop cluster for performing this
practically. Download zookeeper on node1.mylabs.com using the following
command. By default the tar file will be installed in/home/hadoop location.
wget http://www-us.apache.org/dist/zookeeper/zookeeper-
3.4.6/zookeeper-3.4.6.tar.gz
Step 2: Extract the tar file in the home folder in node1.mylabs.com.
tar -xvzf zookeeper-3.4.6.tar.gz
Step 3: Setup Environment Variable for Zookeeper in node1.mylabs.com.
vi /home/hadoop/.bashrc
#Add the following lines at the beginning of the file.
export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.4.6
export PATH=$PATH:$ZOOKEEPER_HOME/bin
Once you save the .bashrc file, update the environment variable file using the
following command: exec bash
Step 4 : Configure Zookeeper to work on Ensemble mode in node1.mylabs.com
vi /home/hadoop/zookeeper-3.4.6/conf/zoo.cfg tickTime=2000
clientPort=2181
initLimit=5
syncLimit=2
dataDir=/home/hadoop/zookeeper-3.4.6/data
dataLogDir=/home/hadoop/zookeeper-3.4.6/logs
server.1=node1.mylabs.com:2888:3888
server.2=node1.mylabs.com:2888:3888
server.3=node1.mylabs.com:2888:3888
Step 5: Create necessary folders for Zookeeper and assign ownership.
mkdir -p /home/hadoop/zookeeper-3.4.6/data
mkdir -p /home/hadoop/zookeeper-3.4.6/logs Here, the data folder
will be used to store all snapshots of the Zookeeper in-memory store and the
data of the Zookeeper. The logs folder will be used to maintain transaction
logs of the updates done in the in-memory store of the Zookeeper.
Step 6: Create myid file in data folder (value for myid will be 1 for us).
vi /home/hadoop/zookeeper-3.4.6/data/myid 1
We have configured server.1 in zoo.cfg file. Here, the myid file represents the
unique id of the Zookeeper server. The myid file is stored in the data folder of
the Zookeeper servers.
Step 7: Once you have completed all steps till Step 6, copy this entire setup in
the remaining two nodes (node2.mylabs.com and node3.mylabs.com). This will
ensure that you do not redo all steps in the other system. Also, this step will
ensure SSH communication between nodes is working well. SSH
communication, however, has no relation to Zookeeper. Perform the following
command in node1.mylabs.com: scp -r /home/hadoop/zookeeper-3.4.6
hadoop@node2.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/zookeeper-3.4.6
hadoop@node3.mylabs.com:/home/hadoop/.
Step 8: Let us make necessary changes/configurations in node2.mylabs.com and
node3.mylabs.com
node2.mylabs.com
a. Setup .bashrc file with Zookeeper environment variables.
vi /home/hadoop/.bashrc
#Add the following lines at the beginning of the file.
export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.4.6
export PATH=$PATH:$ZOOKEEPER_HOME/bin Once the
changes are saved, update the environment variable using the following
command exec bash
b. Make changes in the myid file present in data folder of Zookeeper.
vi /home/hadoop/zookeeper-3.4.6/data/myid 2

node3.mylabs.com
A. Setup .bashrc file with Zookeeper environment variables.
vi /home/hadoop/.bashrc
#Add the following lines at the beginning of the file.
export ZOOKEEPER_HOME=/home/hadoop/zookeeper-3.4.6
export PATH=$PATH:$ZOOKEEPER_HOME/bin Once the
changes are saved, update the environment variable using the following
command exec bash
B. Make changes in the myid file present in data folder of Zookeeper.
vi /home/hadoop/zookeeper-3.4.6/data/myid 3
Step 9: Start Zookeeper Service in all 3 nodes manually using the following
command: zkServer.sh start
Once started, check whether the service has started or not. To do so type: jps
If you get QuorumPeerMain service in the list, it means that the Zookeeper
service is running. To check the states of Zookeeper, perform the following
command: zkServer.sh status
If it shows the following, it means your configuration is successful. One
system will show the mode as a leader while other two systems will show the
mode as followers.
JMX enabled by default
Using config: /home/hadoop/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: leader
Working with Zookeeper CLI
Lab 29 Running basic commands in Zookeeper CLI

Problem Statement 3 – Demo on Zookeeper CLI


Create a ZNode named zkdemo and insert a data with string “GoodMorning”
and check the Leaf node. Once done, change the string from “GoodMorning” to
“GoodNight”.
Solution
The intention of this demo is to give you an idea of how to view the data that
is stored in the Zookeeper store. This is useful while troubleshooting for HA-
enabled cluster and for HBase problems. This practical can be done in any setup
(Standalone and Ensemble setup). In this case, we will perform this practical in
MultiNode (Ensemble) setup so that we can also check the High Availability and
Fault-tolerance feature of Zookeeper.
Step 1: Start zkCli.sh command line shell for Zookeeper.
zkCli.sh
Step 2: Let us check if there exist any existing ZNodes. type ls /
[zk: localhost:2181(CONNECTED) 1] ls /
[rmstore, hadoop-ha, yarn-leader-election, zookeeper]
Since we are using a HA-enabled cluster, we get the above output. Now let us
create a ZNode named ‘zkdemo’ on the root location of Zookeeper hierarchy
with string “GoodMorning”.
[zk: localhost:2181(CONNECTED) 2] create /zkDemo
“GoodMorning”
Created /zkDemo
To check whether node is created with data:
[zk: localhost:2181(CONNECTED) 3] ls /
[zkDemo, rmstore, yarn-leader-election, hadoop-ha, zookeeper]
[zk: localhost:2181(CONNECTED) 4] get /zkDemo
“GoodMorning”
cZxid = 0x70000000c
ctime = Sun Jun 18 19:02:00 IST 2017
mZxid = 0x70000000c
mtime = Sun Jun 18 19:02:00 IST 2017
pZxid = 0x70000000c
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 13
numChildren = 0
Let us edit the data from GoodMorning to GoodNight:
[zk: localhost:2181(CONNECTED) 3] set /zkDemo “GoodNight”
cZxid = 0x70000000c
ctime = Sun Jun 18 19:02:00 IST 2017
mZxid = 0x800000008
mtime = Sun Jun 18 19:06:54 IST 2017
pZxid = 0x70000000c
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 11
numChildren = 0
[zk: localhost:2181(CONNECTED) 4] get /zkDemo “GoodNight”
cZxid = 0x70000000c
ctime = Sun Jun 18 19:02:00 IST 2017
mZxid = 0x800000008
mtime = Sun Jun 18 19:06:54 IST 2017
pZxid = 0x70000000c
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 11
numChildren = 0
Now let us delete the ZNode:
[zk: localhost:2181(CONNECTED) 5] delete /zkDemo Summary
Zookeeper is meant to co-ordinate between heterogeneous processes in a
distributed environment.
Zookeeper Ensemble is a group of Zookeeper servers that form a Leader-
Follower Configured cluster.
Zookeeper is ideally used in Hadoop HA to maintain the state information
of the NameNode and ResourceManager service.
Zookeeper is used in HBase to ensure HBase services are synchronized
with each other.
To internally see the ZNodes created, zkCli.sh is a handy tool provided in
Zookeeper.
In production, a minimum of 5 Zookeeper servers are recommended in a
Zookeeper Ensemble.
Chapter 7
High Availability in Apache Hadoop

In this chapter you will learn,

Why we need High Availability.


NameNode HDFS HA Methods.
Fencing in Hadoop.
HDFS HA using QJM.
HDFS HA using Shared Storage
YARN HA using Zookeeper
Setting up Fresh HDFS-HA Enabled Cluster using QJM.
Setting up Fresh YARN-HA Enabled Cluster using QJM.
Setting up HDFS & YARN HA in an existing non-HA cluster using
QJM.
Building a Federated HA-enabled cluster.
Rolling Upgrade in Hadoop.

Introduction
When it comes to Apache Hadoop Cluster, the NameNode system is a single
point of failure. If NameNode and ResourceManager services fail, the entire
cluster goes down. Neither the storage nor the processing request will be
accepted by the cluster. Thus, to make the cluster highly available, Apache
Hadoop introduced NameNode HA or HDFS HA and ResourceManager HA
where NameNode/HDFS HA protect the storage services and ResourceManager
HA protects the processing and cluster management services from known and
unknown disasters.

Need of High Availability


We typically need High Availability to manage the following two scenarios:

1. Planned and scheduled maintenance of the NameNode server. It can


either be a software or hardware upgrade, patching, or audit mode
where the service may be required to be stopped/restarted.
2. Unplanned maintenance and disasters may result in downtime.
To address the above-mentioned problems, the HA configuration allows the
cluster to maintain two NameNodes namely Active NameNode and Standby
NameNode where the Active NameNode is responsible to serve all client
operations and perform all responsibilities, whereas Standby NameNode will
ensure that it stays in sync in terms of maintaining enough state information to
provide a fast-failover just in case of any unplanned disasters or administrator-
initiated failovers. Up to Hadoop 2.x, it was not possible to have more than 2
NameNodes. However in Hadoop3.x which is in alpha3 stage while writing this
book, it is possible to support more number of NameNodes. It is suggested,
however, to not exceed 5 NameNodes with a recommended 3 NameNodes to
avoid any communication disasters.

Methods for HA Configuration


In Apache Hadoop, there exist two configurations to achieve High Availability.
They are:

1. Using Quorum Journal Manager (QJM).


2. Using NFS Shared Storage.

In production, the most preferred method is the use of the Quorum Journal
Manager. However, in this book, we will explore both methods.

Cluster HA using Quorum Journal Manager


Given below is an example setup network diagram of the HDFS HA-enabled
cluster using QJM.
In this setup, we have 3 machines (zkjn1,zkjn2 andzkjn3) acting as Journal
and Zookeeper nodes.
Journal Nodes are distributed-storage systems to store NameNode Storage
Metadata. The role of Journal Node is to maintain the edits durably, which are
written by the active NameNode during any write operation in the HDFS. Edit
log in the Journal Node is considered to be successfully written only when it is
written to majority of Journal Nodes. In short, Journal Nodes are used in an HA-
enabled cluster to synchronize Active and Standby NameNodes. The Standby
NameNode, during transition from Standby state to Active state, will apply all
edits from the Journal Nodes before becoming an active NameNode. Journal
Node uses epoc to figure out which request is from ActiveNamenode during
fencing to ensure that it will not honor request from Standby NameNode. You
can get more detailed information on working of Journal Node in the link given
below: https://issues.apache.org/jira/secure/attachment/12547598/qjournal-
design.pdf
The Zookeeper maintains the state information of the NameNode written by
ZKFC client service. Technically, it maintains information of which system in
the cluster is currently Active NameNode. ZooKeeper has been used to avoid
split-brain scenario so that NameNode state does not get disunited due to
failover.
The next component/service is ZKFC i.e. ZooKeeper Failover Controller. This
service is responsible for:

1. Monitoring the health and liveliness of the states of the NameNode.


2. Automatic failover when Active NameNode is not available.
3. Monitor and maintain active lock of NameNode state in Zookeeper.

There are two ZKFC processes in the cluster, each residing on individual
NameNodes. ZKFC is a Zookeeper client process that uses Zookeeper to
maintain the session information and state information of the NameNode
(Active). ZKFC also initiates state transitions and fencing when performing
failovers. To learn more about ZKFC, its design and working, refer the below
documentation link:
https://issues.apache.org/jira/secure/attachment/12519914/zkfc-design.pdf

Understanding Fencing in Hadoop HA


Fencing is not a term introduced by Hadoop. It is a term used in most distributed
systems (especially storage systems) usually where there is need to maintain a
finite state of a process. Fencing is a method used to bring an HA cluster to a
known finite state. In terms of Hadoop, the role of fencing is all about
establishing the certainty of NameNode service in the cluster. In short, Fencing
is a method to prevent the old Active NameNode from transferring all the
writing to JournalNode. Fencing ensures only one NameNode service to be a
writer.
Hadoop provides a range of fencing mechanisms like:

Killing the NameNode service.


Revoking access to shared edit logs in NFS directory
Disabling the port using remote management command.
Forcibly powering the machine down using a specialized power
distribution unit (STONITH – Shoot The Other Node In The Head).

Always remember that Fencing gets invoked whenever the Active NameNode
is not reachable in-terms of network (heartbeat). To monitor this instance and to
invoke Fencing, a watchdog service i.e. ZKFC is used. It is the ZKFC that
triggers fencing so that there are no existing split-brain scenarios.
In Hadoop HDFS HA, we can implement two types of Fencing:

a. SSHfence – This mechanism uses SSH to connect to the old Active


NameNode machine and uses fuser command to kill the process. This
method of fencing is good if the machine is reachable via the network.
But what happens if the system is switched off due to a power failure?
What if the network interface card gets burned/damaged? It is obvious
that the SSHfence cannot kill the process. This results in no state change
and transition occurrences. To avoid this situation, we prefer to use shell
as a Fencing method in production setups. Let us understand shell
Fencing now.
b. Shell – The Shell mechanism simply runs the script specified. If the
script returns true, the Fencing is considered to be a success and the
state transition will occur. This means that whichever NameNode was
on standby will become active under any/all conditions. Ideally, if you
don’t want to write a script, you can just pass ‘shell(/bin/true)’ string
which says that when fencing is initiated by ZKFC, the script will return
true resulting in state transition.

To know more about HA implementation work done by the contributors and


their mindset, refer the paper given below:
https://issues.apache.org/jira/secure/attachment/12480489/NameNode%20HA_v2_1.pdf

HDFS HA Parameters
Let us understand the ideal parameters responsible for setting of HDFS HA.
Following parameters must always be set in hdfs-site.xml with example property
values.
The following parameters must be set in core-site.xml with example property
values:

Let us now understand the ideal parameters responsible for the


YARN/ResourceManager HA setup.
The following parameters must be set in yarn-site.xml with example property
values:

Implementing HDFS HA in Fresh Cluster


Lab 30 Installing and Configuring 4-Node HDFS HA-Enabled Fresh
Cluster

Problem Statement 1 – Create a fresh cluster with HDFS HA enabled.


We will need to setup a 4 node cluster with the following specifications as
given in the diagram:
Node Name Host Name Services Running

Node1 node1.mylabs.com NameNode(active)


DFSZKFailoverController
QuorumPeerMain
JournalNode
Node2 node2.mylabs.com NameNode(standby)
DFSZKFailoverController
QuorumPeerMain
JournalNode
Node3 node3.mylabs.com DataNode
JournalNode
QuorumPeerMain
Node4 node4.mylabs.com DataNode

Solution
Understand your initial setup and fill in the table given below. Ensure that in
Step 1 (under hosts file configuration) you add the IP address of your respective
machine.
Node IP Address for Example Your Machine/VMs IP
Desired Hostname
Name Purpose address
Node1 node1.mylabs.com 192.168.1.1
Node2 node2.mylabs.com 192.168.1.2
Node3 node3.mylabs.com 192.168.1.3
Node4 node4.mylabs.com 192.168.1.4

Step 1: Setup network Hostname and host file configuration on all 4 machines.
In a real production setup, we use a dedicated DNS server for Hostnames.
However, for our lab setup, we will setup local host files as resolvers. Hadoop
ideally recommends working on Hostname rather than IP addresses for host
resolution.
Perform the following steps in each machine:
sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com node1.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.

Once the configuration is done, you will need to restart the machines for
Hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
Step 2: Setup SSH password-less setup between the NameNode system and the
other system. In our example setup, we need an SSH setup password-less
configuration between node1.mylabs.com and the other nodes participating in
this cluster. We do this so that the node1 can contact other nodes to invoke all of
the services. The reason for this kind of configuration is that the NameNode
system is the single point of contact for all users and administrators.
Perform the following commands on node1:

a. Generate SSH keys.


ssh-keygen (Press Enter till the command gets completed. You do not
need to fill anything including the password, since we need to setup a
‘password-less’ key)
This command will generate the public key(id_rsa.pub) and the private
key(id_rsa) in the home folder under .ssh directory.
b. Register the public key in the node1, node2, node3, and node4.
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@node1.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub
hadoop@node2.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub
hadoop@node3.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub
hadoop@node4.mylabs.com
After pressing Enter, you may get a prompt to register the system to
known_hosts. Type ‘yes’ and press Enter. Once that is done, you will see a
prompt for the password. Enter the user password (in our case it will be 123456)
and press Enter. You will see a feedback line stating that the key has being added
successfully.
Step 3: Install Hadoop 2.8.0 in Standalone mode in all 4 nodes. Refer Setting up
Hadoop-2.8.0 in Standalone Mode (CLI MiniCluster) in case you require
additional help.
Step 4: Setup core-site.xml in Node1 and copy the same in other nodes that are
participating in the cluster. Add the following configuration in core-site.xml vi
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cognitocluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdfsdrive</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/hadoop/journal/node/local/data</value>
</property>
</configuration>
In the above configuration, we define the URI to access the HDFS storage
which can then be used by the HDFS client. It also maintains the information of
the JournalNode local storage location. Once done, save the file and copy it into
the other nodes. This step is mandatory since this file is accountable to inform
the network locality of the NameNode system in that cluster. To copy it to
remote systems, perform the following command: scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml Step 5: Setup hdfs-site.xml in Node1 to configure distributed storage
(HDFS) configurations and copy the same in other nodes that are participating in
the cluster.
vi /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>cognitocluster</value>
</property>
<property>
<name>dfs.ha.namenodes.cognitocluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-
address.cognitocluster.nn1</name>
<value>node1.mylabs.com:9000</value> </property>
<property>
<name>dfs.namenode.rpc-
address.cognitocluster.nn2</name>
<value>node2.mylabs.com:9000</value>
</property>
<property>
<name>dfs.namenode.http-
address.cognitocluster.nn1</name>
<value>node1.mylabs.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-
address.cognitocluster.nn2</name>
<value>node2.mylabs.com:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.mylabs.com:8485;
node2.mylabs.com:8485;node3.mylabs.com:8485/cognitocluster</val
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>node1.mylabs.com:2181,node2.mylabs.com:2181,node3.myla
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>3000</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFail
</property>
</configuration>
In the above configuration, we have provided all the necessary information
required to setup HDFS HA as discussed in the previous topics. Now, save the
file and copy it into other nodes. To copy the file in remote systems, perform the
following command: scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml Step 6: Setup Slaves file in node1.mylabs.com and ensure that
Datanode service will start only on node3.mylabs.com and node4.mylabs.com.
Once set, copy the same in node2.mylabs.com.
vi /home/hadoop/hadoop2/etc/hadoop/slaves
node3.mylabs.com
node4.mylabs.com
scp -r /home/hadoop/hadoop2/etc/hadoop/slaves
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/slaves
Step 7: Create HDFSdrive folder and JournalNode local folders in all 4
systems (node1.mylabs.com, node2.mylabs.com,node3.mylabs.com,
node4.mylabs.com).
mkdir /home/hadoop/hdfsdrive
mkdir -p /home/hadoop/journal/node/local/data
Step 8: Setup Zookeeper service in node1.mylabs.com, node2.mylabs.com, and
node3.mylabs.com and ensure that it follows the leader-follower approach. You
can refer to the previous chapter for Zookeeper Leader-Follower configuration
Installation steps. Once installation is done, ensure Zookeeper services are live
and active in all the three nodes (node1.mylabs.com, node2.mylabs.com, and
node3.mylabs.com) Step 9: Format Zookeeper Failover Controller. This is done
to create a ZNode dedicated for storing information conveyed by
DFSZKFailoverController service. As discussed in previous the section, it is all
about maintaining information of the active NameNode of the current session.
You need to perform the following command in node1.mylabs.com: hdfs zkfc -
formatZK
Ensure you get the success line similar to the one shown below (last 4 lines).
17/06/18 13:40:51 INFO ha.ActiveStandbyElector: Session connected.
17/06/18 13:40:51 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/cognitocluster in
ZK.
17/06/18 13:40:51 INFO zookeeper.ClientCnxn: EventThread shut down 17/06/18 13:40:51 INFO
zookeeper.ZooKeeper: Session: 0x35cba37e2070000 closed In the case that you get any errors during
this format, check and confirm whether the Zookeeper service is running in all three systems. Also
ensure that one of them is a leader and others are followers.

Step 10: Start JournalNode service in node1.mylabs.com , node2.mylabs.com,


and node3.mylabs.com manually using the following command: hadoop-
daemon.sh start journalnode
Step 11: Format the NameNode in node1.mylabs.com. This will ensure that
node1.mylabs.com will be the active NameNode. You need to perform the
following command in node1.mylabs.com as shown below: hdfs namenode -
format
Ensure you get the success line. In case you get any error during this step,
ensure that the JournalNode service is live and active in all 3 intended nodes and
that the HDFSdrive folder has necessary write permissions for current user.
Step 12: Start the NameNode service in node1.mylabs.com. This will initialize
the creation of filesystem metadata. You can start the same using the following
command: hadoop-daemon.sh start namenode
Step 13: Now bootstrap the metadata to node2.mylabs.com. This is done so that
the metadata will be shared from node1.mylabs.com to node2.mylabs.com. it
will ensure that node2.mylabs.com will become the standby NameNode. This
can be done using the following command: hdfs namenode -bootstrapStandby
You will get something similar to the lines shown below that denote successful
bootstrapping: 17/06/18 13:46:06 INFO common.Storage: Storage directory
/home/hadoop/hdfsdrive/dfs/name has been successfully formatted.
17/06/18 13:46:06 INFO namenode.TransferFsImage: Opening connection to
http://node1.mylabs.com:50070/imagetransfer?
getimage=1&txid=1&storageInfo=-63:874026095:0:CID-b9a78825-a81b-4a23-931b-da9e2b7208e9
17/06/18 13:46:06 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000
milliseconds 17/06/18 13:46:07 INFO namenode.TransferFsImage: Transfer took 0.00s at 0.00 KB/s
17/06/18 13:46:07 INFO namenode.TransferFsImage: Downloaded file
fsimage.ckpt_0000000000000000001 size 353 bytes.
17/06/18 13:46:07 INFO util.ExitUtil: Exiting with status 0

Step 14: Stop all services from node1.mylabs.com. Don’t worry if you see any
warning or error messages. This will ensure and inform if the service requires a
stop command or not.
stop-all.sh
Step 15: Start HDFS service at cluster from node1.mylabs.com. Before starting,
ensure node1.mylabs.com, node2.mylabs.com, and node3.mylabs.com has
Zookeeper service running in Leader-Follower Configuration. This can be
achieved by performing the following command: start-dfs.sh
Following is the expected service layout for each node:

To check the service state of NameNode (active/standby) , type the following


command: hdfs haadmin -getServiceState nn1 (nn1 is the logical name of
node1.mylabs.com as configured in hdfs-site.xml file) hdfs haadmin -
getServiceState nn2 (nn2 is the logical name of node2.mylabs.com as
configured in hdfs-site.xml file) Testing Failover
We can check whether the NameNode automatic failover is working or not, by
killing the process in node1.mylabs.com node assuming node1.mylabs.com
holds Active ResourceManager. You can get the process id using jps command.
To kill the process, type: kill -9 <process_id>
e.g. kill -9 5878
After killing the process, type the following command to check the state. If the
state is transitioned from standby to active, then it means your failover
mechanism is working and the configuration is successful.
hdfs haadmin -getServiceState nn2
Implementing YARN HA in Fresh Cluster
Lab 31 Configuring YARN ResourceManager HA in the 4 Node HDFS HA-
Enabled Cluster

Problem Statement 2 – Improvise the above setup to enable YARN


ResourceManager HA.
We need to enable High Availability of ResourceManager service in our
existing 4 nodes created. Following is represented in the diagram below:

Solution
Ensure that you have setup the HDFS HA cluster as per the solution in
Problem Statement 1. Ensure that all services are live and active. Now perform
the following steps: Step 1: On node1.mylabs.com, configure yarn-site.xml with
the following parameters:
vi /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-
services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cognitocluster</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>node1.mylabs.com</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>node2.mylabs.com</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>node1.mylabs.com:2181,node2.mylabs.com:2181,node3.myla
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>node1.mylabs.com:9026</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>node2.mylabs.com:9026</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZK
</property>
<property>
<name>yarn.client.failover-proxy-provider</name>
<value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxy
</property>
Now copy the same file in node2.mylabs.com, node3.mylabs.com, and
node4.mylabs.com:
scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml Step 2: Setup mapred-site.xml. This configuration is responsible
for maintaining the MapReduce job configurations. We will configure MR
jobs to run on top of YARN. This file is not available by default. However a
template file is provided. Let us now configure mapred-site.xml: cp
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml.template
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml vi
/home/hadoop/hadoop2/etc/hadoop/mapred-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 3: Start ResourceManager manually in node1.mylabs.com and
node2.mylabs.com.
yarn-daemon.sh start resourcemanager
Step 4: To check the service state of the ResourceManager (active/standby), type
the following command: yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2
Testing ResourceManager Failover
We can check whether the ResourceManager Automatic Failover is functional or
not, by killing the process in node1.mylabs.com assuming that
node1.mylabs.com holds Active ResourceManager. You can get the process id
using jps command. To kill the process, type: kill -9 <process_id>
e.g. kill -9 5878
After killing the process, type the following command to check the state. If the
state is transitioned from standby to active, then it means your failover
mechanism is working and the configuration is successful.
yarn rmadmin -getServiceState rm2
Implementing HDFS and YARN HA in an Existing
non-HA Cluster
Lab 32 Configuring HDFS and YARN ResourceManager HA in An Existing
Non-HA Enabled Cluster without Any Data Loss Problem Statement 3
– Enabling HDFS and YARN HA on an existing 4 node non-HA cluster.

We need to enable High Availability of NameNode and ResourceManager


services in our existing 4 node cluster as represented in the below figure:

We need to achieve the below configuration ensuring the existing data in


HDFS must be un-affected.
Solution
Step 1: Stop all Hadoop services if running.
Step 2: Setup core-site.xml in Node1 and copy the same in other nodes that are
participating in the cluster. Add the following configuration in core-site.xml vi
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cognitocluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdfsdrive</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/hadoop/journal/node/local/data</value>
</property>
</configuration>
In the above configuration, we define the URI to access the HDFS storage
which can then be used by the HDFS client. It also maintains the information of
the JournalNode local storage location. Once done, save the file and copy it into
the other nodes. This step is mandatory since this file is accountable to inform
the network locality of the NameNode system in that cluster. To copy it to
remote systems, perform the following command: scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml Step 3: Setup hdfs-site.xml in Node1 to configure distributed storage
(HDFS) configurations and copy the same in all other nodes that are
participating in the cluster.
vi /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>cognitocluster</value>
</property>
<property>
<name>dfs.ha.namenodes.cognitocluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-
address.cognitocluster.nn1</name>
<value>node1.mylabs.com:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-
address.cognitocluster.nn2</name>
<value>node2.mylabs.com:9000</value>
</property>
<property>
<name>dfs.namenode.http-
address.cognitocluster.nn1</name>
<value>node1.mylabs.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-
address.cognitocluster.nn2</name>
<value>node2.mylabs.com:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.mylabs.com:8485;
node2.mylabs.com:8485;node3.mylabs.com:8485/cognitocluster</val
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>node1.mylabs.com:2181,node2.mylabs.com:2181,node3.myla
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>3000</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFail
</property>
</configuration>
In the above configuration, we have provided all necessary information
required to setup HDFS HA as discussed in the previous topics. Now, save the
file and copy it into all other nodes. To copy the file in remote systems, perform
the following command: scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml Step 4: On node1.mylabs.com, configure yarn-site.xml with the
following parameters:
vi /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-
services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cognitocluster</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>node1.mylabs.com</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>node2.mylabs.com</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>node1.mylabs.com:2181,node2.mylabs.com:2181,node3.myla
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>node1.mylabs.com:9026</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>node2.mylabs.com:9026</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZK
</property>
<property>
<name>yarn.client.failover-proxy-provider</name>
<value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxy
</property>
Now copy the same file in node2.mylabs.com, node3.mylabs.com, and
node4.mylabs.com:
scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/yarn-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/yarn-
site.xml Step 5: Setup Zookeeper service in node1.mylabs.com,
node2.mylabs.com, and node3.mylabs.com and ensure that it follows the
leader-follower approach. You can refer to the previous chapter for
Zookeeper Leader-Follower configuration Installation steps. Once
installation is done, ensure Zookeeper services are live and active in all the
three nodes (node1.mylabs.com, node2.mylabs.com and node3.mylabs.com).
Step 6: Format Zookeeper Failover Controller. This is done to create a ZNode
dedicated for storing information conveyed by DFSZKFailoverController
service. As discussed in the previous section, it is all about maintaining
information of the active NameNode of the current session. You need to perform
the following command in node1.mylabs.com: hdfs zkfc -formatZK
Ensure you get the success line similar to the one shown below (last 4 lines).
17/06/18 13:40:51 INFO ha.ActiveStandbyElector: Session connected.
17/06/18 13:40:51 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/cognitocluster in
ZK.
17/06/18 13:40:51 INFO zookeeper.ClientCnxn: EventThread shut down 17/06/18 13:40:51 INFO
zookeeper.ZooKeeper: Session: 0x35cba37e2070000 closed In the case that you get any errors during
this format, check and confirm whether the Zookeeper service is running in all three systems. Also
ensure that one of them is a leader and others are followers.
Step 7: Create JournalNode folder and start JournalNode service in
node1.mylabs.com, node2.mylabs.com, and node3.mylabs.com manually using
the following command: hadoop-daemon.sh start journalnode
Step 8: Now we need to do Journal Edits formatting to ensure copy of existing
edits are placed in the local directory of the Journal Node. This can be achieved
by performing the following command in node1.mylabs.com: hdfs namenode -
initializeSharedEdits
Step 9: Now bootstrap the metadata to node2.mylabs.com. This is done so that
the metadata will be shared from node1.mylabs.com to node2.mylabs.com. Now,
node2.mylabs.com will become the standby NameNode. This can be done using
the following command: hdfs namenode -bootstrapStandby
You will get something similar to the lines as shown below that denote
successful bootstrapping: 17/06/18 13:46:06 INFO common.Storage: Storage
directory /home/hadoop/hdfsdrive/dfs/name has been successfully
formatted.
17/06/18 13:46:06 INFO namenode.TransferFsImage: Opening connection to
http://node1.mylabs.com:50070/imagetransfer?
getimage=1&txid=1&storageInfo=-63:874026095:0:CID-b9a78825-a81b-4a23-931b-da9e2b7208e9
17/06/18 13:46:06 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000
milliseconds 17/06/18 13:46:07 INFO namenode.TransferFsImage: Transfer took 0.00s at 0.00 KB/s
17/06/18 13:46:07 INFO namenode.TransferFsImage: Downloaded file
fsimage.ckpt_0000000000000000001 size 353 bytes.
17/06/18 13:46:07 INFO util.ExitUtil: Exiting with status 0

Step 10: Stop all services from node1.mylabs.com. Don’t worry if you see any
warning or error messages. This will ensure and inform if the service requires a
stop command or not.
stop-all.sh
Step 11: Start HDFS service at cluster from node1.mylabs.com. Before starting,
ensure that node1.mylabs.com, node2.mylabs.com, and node3.mylabs.com has
Zookeeper service running in Leader-Follower Configuration. This can be
achieved by performing the following command: start-dfs.sh
Step 12: Start ResourceManager manually in node1.mylabs.com and
node2.mylabs.com.
yarn-daemon.sh start resourcemanager
Step 13: Start NodeManager manually in node1.mylabs.com using the following
command:
yarn-daemons.sh start nodemanager
Following is the expected service layout for each node.

Building a Federated HA-Enabled Cluster


Lab 33 Building a Federated HA-Enabled Cluster

In this section, we will configure a federated HA-enabled cluster. For our hands-
on exercise we will create a 5 node cluster with the following specification.
Team Node Name HostName Services Running
DBA Node1 dba1.mylabs.com NameNode(active)
DFSZKFailoverController
QuorumPeerMain
JournalNode
DBA Node2 dba2.mylabs.com NameNode(standby)
DFSZKFailoverController
QuorumPeerMain
JournalNode
Analytics Node3 analytics1.mylabs.com NameNode(active)
DFSZKFailoverController
QuorumPeerMain
JournalNode
Analytics Node4 analytics2.mylabs.com NameNode(standby)
DFSZKFailoverController
Shared HDFS Node5 dn1.mylabs.com DataNode
JournalNode

Scenario: We need to create a federated HA-enabled cluster such that the same
cluster can be shared to two project teams viz.

a. DBA Team
b. Analytics Team

The intention of this cluster is to re-use the storage for multiple teams with
separation of metadata. Also we need to ensure the single point of failure
problem must be removed by introducing Standby NameNode resulting HA-
enabled environment.
Solution
A typical Federated HA enabled cluster looks something like this in a
production setup.

In the figure above, we have two set of Namenodes, one for DBA and second
for Analytics Team. As shown above, each team has one active and one standby
Namenodes which ensures single point of failure is removed. However for our
setup we will create something similar to the one shown below:
Lets start. Understand your setup and fill in the table given below. Ensure that
in Step 1 (under hosts file configuration) you add the IP address of your
machine.
Node IP Address for Your Machine/VMs
Desired Hostname
Name Example Purpose IP address
Node1 dba1.mylabs.com 192.168.1.1
Node2 dba2.mylabs.com 192.168.1.2
Node3 analytics1.mylabs.com 192.168.1.3
Node4 analytics2.mylabs.com 192.168.1.4
Node5 dn1.mylabs.com 192.168.1.5

Step 1: Setup network Hostname and host file configuration on all 4 machines.
In a real production setup, we use a dedicated DNS server for Hostnames.
However, for our lab setup, we will setup local host files as resolvers. Hadoop
ideally recommends working on Hostname rather than IP addresses for host
resolution.
Perform the following steps in each machine.
sudo vi /etc/hostname
#Replace the existing hostname with the desired hostname as
mentioned in previous step. For example, #for node1 it will be
node1.mylabs.com node1.mylabs.com
sudo vi /etc/hosts
#Comment 127.0.1.1 line and add the following lines in all 4
machines. #This file holds the information of all machines which
needs to be #resolved. Each machine will have entries of all 4
machines that are #participating in the cluster installation.

Once the configuration is done, you will need to restart the machines for
Hostname changes to take effect. To restart your machine, you can type the
following command: sudo init 6
Step 2: Setup SSH password-less setup between your NameNode system and the
other system. In our example setup, we need an SSH setup password-less
configuration between node1.mylabs.com and the other nodes participating in
the cluster. This is done so that the node1 can contact other nodes to invoke the
services. The reason for this kind of configuration is that the NameNode system
is the single point of contact for the users and administrators.
Perform the following commands on node1:

a. Generate SSH keys.


ssh-keygen (Press Enter till the command gets completed. You do not
need to fill anything including the password, since we need to setup a
‘password-less’ key)
This command will generate the public key(id_rsa.pub) and the private
key(id_rsa) in the home folder under .ssh directory.
b. Register the public key in the node1, node2, node3, and node4.
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@dba1.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub hadoop@dba2.mylabs.com
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub
hadoop@analytics1.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub
hadoop@analytics2.mylabs.com ssh-copy-id -i
/home/hadoop/.ssh/id_rsa.pub hadoop@dn1.mylabs.com
Where:
hadoop is the username
After pressing Enter, you may get a prompt to register the system to
known_hosts. Type ‘yes’ and press Enter. Once that’s done, you may see a
prompt for the password. Enter the user’s password (in our case it will be
123456) and press Enter. You will see a feedback line stating that the key has
being added successfully.
Step 3: Install Hadoop 2.8.0 in Standalone mode in all 5 nodes. Refer Setting up
Hadoop-2.8.0 in Standalone Mode (CLI MiniCluster) in case you require
additional help.
Step 4: Setup core-site.xml in Node1 and copy the same in other nodes that are
participating in the cluster. Add the following configuration in core-site.xml vi
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./dba</name>
<value>hdfs://dba</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./analytics</name>
<value>hdfs://analytics</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdfsdrive</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/hadoop/journal/node/local/data</value>
</property>
</configuration>
In the above configuration, we define the URI to access the HDFS storage
which can be used by HDFS client. It also maintains the information of the
JournalNode local storage location. Once done, save the file and copy it into
other nodes. This step is mandatory since this file is accountable to inform the
network locality of the Namenode system in the cluster. To copy it to remote
systems, perform the following command: scp -r
/home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@dba2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@analytics1.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@analytics2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/core-site.xml
hadoop@dn1.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/core-site.xml
Step 5: Setup hdfs-site.xml in Node1 to configure distributed storage (HDFS)
configurations and copy the same in other nodes that are participating in the
cluster.
vi /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>dba,analytics</value>
</property>
<property>
<name>dfs.ha.namenodes.dba</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.ha.namenodes.analytics</name>
<value>nn3,nn4</value>
</property>
<property>
<name>dfs.namenode.rpc-address.dba.nn1</name>
<value>dba1.mylabs.com:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.dba.nn2</name>
<value>dba2.mylabs.com:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.analytics.nn3</name>
<value>analytics1.mylabs.com:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.analytics.nn4</name>
<value>analytics2.mylabs.com:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.dba.nn1</name>
<value>dba1.mylabs.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.dba.nn2</name>
<value>dba2.mylabs.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.analytics.nn3</name>
<value>analytics1.mylabs.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.analytics.nn4</name>
<value>analytics2.mylabs.com:50070</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/hadoop/journal/node/local/data</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir.dba</name>
<value>qjournal://dba1.mylabs.com:8485;dba2.mylabs.com:8485;an
</property>
<property>
<name>dfs.namenode.shared.edits.dir.analytics</name>
<value>qjournal://dba1.mylabs.com:8485;dba2.mylabs.com:8485;an
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>dba1.mylabs.com:2181,dba2.mylabs.com:2181,analytics1.my
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>3000</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.sales</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFail
</property>
<property>
<name>dfs.client.failover.proxy.provider.analytics</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFail
</property>
</configuration>
In the above configuration, we have provided all necessary information
required to setup HDFS HA as discussed in the previous topics. Now, save the
file and copy it into other nodes. To copy the file in remote systems, perform the
following command: scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node3.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml scp -r /home/hadoop/hadoop2/etc/hadoop/hdfs-site.xml
hadoop@node4.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/hdfs-
site.xml Step 6: Setup Slaves file in dba1.mylabs.com ensure Datanode service
will start only on dn1.mylabs.com. Once set, copy the same in dba2.mylabs.com,
analytics1.mylabs.com and analytics2.mylabs.com.
vi /home/hadoop/hadoop2/etc/hadoop/slaves
dn1.mylabs.com
scp -r /home/hadoop/hadoop2/etc/hadoop/slaves
hadoop@dba2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/slaves
scp -r /home/hadoop/hadoop2/etc/hadoop/slaves
hadoop@analytics1.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/slaves
scp -r /home/hadoop/hadoop2/etc/hadoop/slaves
hadoop@analytics2.mylabs.com:/home/hadoop/hadoop2/etc/hadoop/slaves
Step 7: Create hdfsdrive folder and Journal Node local folders in all the 5
systems (dba1.mylabs.com, dba2.mylabs.com, analytics1.mylabs.com,
analytics2.mylabs.com and dn1.mylabs.com) .
mkdir /home/hadoop/hdfsdrive
mkdir -p /home/hadoop/journal/node/local/data
Step 8: Setup Zookeeper service in dba1.mylabs.com, dba2.mylabs.com,
analytics1.mylabs.com and ensure it follows leader-follower approach. You can
refer the previous chapter for Zookeeper Leader-Follower configuration
Installation steps. Once installation is done, ensure Zookeeper services are live
and active in all the three nodes (dba1.mylabs.com, dba2.mylabs.com and
analytics1.mylabs.com) Step 9: Format Zookeeper Failover Controller. This is
done to create a ZNode dedicated for storing information conveyed by
DFSZKFailoverController service. As discussed in previous section, its all about
maintaining information of the active namenode of the current session. You need
to perform the following command in dba1.mylabs.com for DBA Namenode and
analytics1.mylabs.com for Analytics Namenode.
On dba1.mylabs.com
hdfs zkfc -formatZK
Ensure you get the success line similar to the one shown below (last 4 lines).
17/06/18 13:40:51 INFO ha.ActiveStandbyElector: Session connected.
17/06/18 13:40:51 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/dba in ZK.
17/06/18 13:40:51 INFO zookeeper.ClientCnxn: EventThread shut down 17/06/18 13:40:51 INFO
zookeeper.ZooKeeper: Session: 0x35cba37e2070000 closed On analytics1.mylabs.com

hdfs zkfc -formatZK


Ensure you get the success line similar to the one shown below (last 4 lines).
17/06/18 13:40:51 INFO ha.ActiveStandbyElector: Session connected.
17/06/18 13:40:51 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/analytics in ZK.
17/06/18 13:40:51 INFO zookeeper.ClientCnxn: EventThread shut down 17/06/18 13:40:51 INFO
zookeeper.ZooKeeper: Session: 0x35cba37e2070000 closed In case you get any errors during this
format, check and confirm whether Zookeeper service is running in all three systems and also ensure
one of them must be a leader and others followers.
Step 10: Start JournalNode service in dba1.mylabs.com, dba2.mylabs.com and
analytics1.mylabs.com manually using the following command, hadoop-
daemon.sh start journalnode
Step 11: Format the namenode in dba1.mylabs.com and analytics1.mylabs.com
so that they can have their independent metadata which is shared in common
Journals. It will also ensure dba1.mylabs.com and analytics1.mylabs.com will be
the active namenode. To make this cluster a federated cluster, we need to also
ensure the cluster has a common clusterID. This can be set during formatting
phase. You need to perform the following command: On dba1.mylabs.com
hdfs namenode -format -clusterID
CognitoITCluster
On analytics1.mylabs.com
hdfs namenode -format -clusterID
CognitoITCluster
Ensure you get the success line. In case you get any error during this step,
ensure journalnode service is live and active in all 3 intended nodes and
hdfsdrive folder has necessary write permissions for current user.
Step 12: Start the namenode service in dba1.mylabs.com and
analytics1.mylabs.com. This will initialize the creation of filesystem metadata on
both independent namenodes. You can start the same using the following
command, On dba1.mylabs.com
hadoop-daemon.sh start namenode
On analytics1.mylabs.com
hadoop-daemon.sh start namenode
Step 13: Now bootstrap the metadata to dba2.mylabs.com and
analytics2.mylabs.com from dba1.mylabs.com and analytics1.mylabs.com
respectively. This is done so that the metadata will be shared from
dba1.mylabs.com to dba2.mylabs.com so that dba2.mylabs.com will become
standby namenode for DBA cluster and from analytics1.mylabs.com to
analytics2.mylabs.com so that analytics2.mylabs.com will become standby
namenode for Analytics cluster. This can be done using the following command,
On dba2.mylabs.com
hdfs namenode -bootstrapStandby
On analytics2.mylabs.com
hdfs namenode -bootstrapStandby
You will get something similar lines as shown below which denotes
bootstrapping is successful.
17/06/18 13:46:06 INFO common.Storage: Storage directory /home/hadoop/hdfsdrive/dfs/name has
been successfully formatted.
17/06/18 13:46:06 INFO namenode.TransferFsImage: Opening connection to
http://node1.mylabs.com:50070/imagetransfer?
getimage=1&txid=1&storageInfo=-63:874026095:0:CID-b9a78825-a81b-4a23-931b-da9e2b7208e9
17/06/18 13:46:06 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000
milliseconds 17/06/18 13:46:07 INFO namenode.TransferFsImage: Transfer took 0.00s at 0.00 KB/s
17/06/18 13:46:07 INFO namenode.TransferFsImage: Downloaded file
fsimage.ckpt_0000000000000000001 size 353 bytes.
17/06/18 13:46:07 INFO util.ExitUtil: Exiting with status 0

Step 14: Now start namenode service in dba2.mylabs.com and


analytics2.mylabs.com. Also start zkfc service in dba1.mylabs.com,
dba2.mylabs.com, analytics1.mylabs.com and analytics2.mylabs.com manually
using the following commands, On dba1.mylabs.com
hadoop-daemon.sh start zkfc
On dba2.mylabs.com
hadoop-daemon.sh start namenode
hadoop-daemon.sh start zkfc
On analytics1.mylabs.com
hadoop-daemon.sh start zkfc
On analytics2.mylabs.com hadoop-daemon.sh start namenode
hadoop-daemon.sh start zkfc

Step 15: You can check the WebUI of each namenode to check the active and
standby namenode of each department.
http://dba1.mylabs.com:50070
http://dba1.mylabs.com:50070
http://analytics1.mylabs.com:50070
http://analytics2.mylabs.com:50070
Performing HDFS Rolling Upgrade from Hadoop-2.7.3 to
Hadoop-2.8.1
Lab 34 Performing Rolling Upgrade from Hadoop-2.7.3 to Hadoop-2.8.0 in
an Existing 4 node HDFS and YARN RM HA-Enabled Cluster
Problem Statement: Upgrade your existing hadoop-2.7.3 cluster into
hadoop-2.8.1 cluster with the following specification.

After Upgrade
Node Hostname Existing Cluster (hadoop-2.7.3)
(hadoop-2.8.1)
node1.mylabs.com NameNode(active) NameNode(active)
DFSZKFailover DFSZKFailover
Controller Controller
QuorumPeerMain QuorumPeerMain
JournalNode JournalNode
node2.mylabs.com NameNode(standby) NameNode(standby)
DFSZKFailover DFSZKFailover
Controller Controller
QuorumPeerMain QuorumPeerMain
JournalNode JournalNode
node3.mylabs.com DataNode DataNode
QuorumPeerMain QuorumPeerMain
JournalNode JournalNode
node4.mylabs.com DataNode DataNode
Step 1: Perform the following command in node1.mylabs.com and
node2.mylabs.com (system holding active and standby namenode) hdfs
dfsadmin -rollingUpgrade prepare
This command creates fsimage for rollback and prepares the Namenode
systems for rolling upgrade. You can verify the status of rollingUpgrade either in
WebUI of namenode or using the following command: hdfs dfsadmin -
rollingUpgrade query
Step 2: Once the Namenode system is enabled for rolling upgrade, on the
standby Namenode, stop all hdfs-associated services and perform hadoop
upgrade. In this example, I am assuming node2.mylabs.com holds the standby
namenode service. So, the following commands will be performed in
node2.mylabs.com, hadoop-daemon.sh stop namenode
hadoop-daemon.sh stop zkfc
hadoop-daemon.sh stop journalnode
Now install hadoop-2.8.1 in node2.mylabs.com as explained below:

a. Download and extract the hadoop-2.8.1 tar file


b. Copy the following configuration files from hadoop2 folder to hadoop-
2.8.1 folder
a. hadoop-env.sh
b. yarn-env.sh
c. core-site.xml
d. mapred-site.xml
e. hdfs-site.xml
f. yarn-site.xml
c. Rename the hadoop2 folder as hadoop2_2.7.3
d. Rename the hadoop-2.8.1 folder as hadoop2. This will ensure the
environment variables for HADOOP_HOME remains unaffected.

Step 3: Start Namenode service with rolling upgrade started in


node2.mylabs.com. Ideally namenode will start in standby mode.
hadoop version (Ensure its 2.8.1)
hadoop-daemon.sh start namenode -rollingUpgrade started
hadoop-daemon.sh start zkfc
hadoop-daemon.sh start journalnode
Here you have successfully upgraded hadoop in node2.mylabs.com. You can
verify the same by checking the WebUI for version.
Step 4: Now transition the state of Namenode from standby to active such that
node2.mylabs.com will hold active namenode service and node1.mylabs.com
will hold standby namenode service. Perform the following command, hdfs
haadmin -failover nn1 nn2
Step 5: Stop Namenode and other HDFS-related services in node1.mylabs.com
and perform installation of hadoop-2.8.1.
hadoop-daemon.sh stop namenode
hadoop-daemon.sh stop zkfc
hadoop-daemon.sh stop journalnode
Now install hadoop-2.8.1 in node2.mylabs.com as explained below:

a. Download and extract the hadoop-2.8.1 tar file


b. Copy the following configuration files from hadoop2 folder to hadoop-
2.8.1 folder
a. hadoop-env.sh
b. yarn-env.sh
c. core-site.xml
d. mapred-site.xml
e. hdfs-site.xml
f. yarn-site.xml
c. Rename the hadoop2 folder as hadoop2_2.7.3
d. Rename the hadoop-2.8.1 folder as hadoop2. This will ensure the
environment variables for HADOOP_HOME remains unaffected.

Step 6: Start Namenode service with rolling upgrade started in


node1.mylabs.com. Ideally namenode will start in standby mode.
hadoop version (Ensure its 2.8.1)
hadoop-daemon.sh start namenode -rollingUpgrade started
hadoop-daemon.sh start zkfc
hadoop-daemon.sh start journalnode
Here you have successfully upgraded hadoop in node1.mylabs.com. You can
verify the same by checking the WebUI for version.
Step 7: Now let’s upgrade datanodes. You need to upgrade datanodes one by
one. Perform the following command in node3.mylabs.com hdfs dfsadmin -
shutdownDatanode node3.mylabs.com:50020 upgrade hadoop-daemon.sh
stop journalnode
Now install hadoop-2.8.1 in node3.mylabs.com as explained below:

a. Download and extract the hadoop-2.8.1 tar file


b. Copy the following configuration files from hadoop2 folder to hadoop-
2.8.1 folder
a. hadoop-env.sh
b. yarn-env.sh
c. core-site.xml
d. mapred-site.xml
e. hdfs-site.xml
f. yarn-site.xml
c. Rename the hadoop2 folder as hadoop2_2.7.3
d. Rename the hadoop-2.8.1 folder as hadoop2. This will ensure the
environment variables for HADOOP_HOME remains unaffected.

Once you complete the above steps start the datanode and journalnode service
in node3.mylabs.com hadoop version (Ensure its 2.8.1)
hadoop-daemon.sh start journalnode
hadoop-daemon.sh start datanode
Step 8: Now upgrade the second datanode. Perform the following command in
node4.mylabs.com hdfs dfsadmin -shutdownDatanode
node4.mylabs.com:50020 upgrade Now install hadoop-2.8.1 in
node4.mylabs.com as explained below:

a. Download and extract the hadoop-2.8.1 tar file


b. Copy the following configuration files from hadoop2 folder to hadoop-
2.8.1 folder
a. hadoop-env.sh
b. yarn-env.sh
c. core-site.xml
d. mapred-site.xml
e. hdfs-site.xml
f. yarn-site.xml
c. Rename the hadoop2 folder as hadoop2_2.7.3
d. Rename the hadoop-2.8.1 folder as hadoop2. This will ensure the
environment variables for HADOOP_HOME remains unaffected.

Once you complete the above steps start the datanode and journalnode service
in node3.mylabs.com hadoop version (Ensure its 2.8.1)
hadoop-daemon.sh start datanode
Step 9: Now since we have upgraded each machine successfully and all services
and HDFS operations are live and active, you can now finalize the upgrade. To
do so perform the following command in node1.mylabs.com hdfs dfsadmin -
rollingUpgrade finalize
Thus you are now able to successfully perform rolling upgrade!
Summary

The adoption of Apache Hadoop increased when the HA feature was


introduced since checkpointing was not considered to be an optimal
solution.
The most adopted method to implement HA is by using
QuorumJournalManager.
It is essential to manually start ResourceMananger in both nodes.
Fencing is all about establishing the certainty of NameNode service in the
cluster.
There exist two fencing techniques viz. SSHFENCE and SHELL fence.
When it comes to production, the most preferred fencing technique used
is the SHELL fence technique.
Chapter 8
Apache Hive Admin Basics

In this chapter, we will learn:

What is Apache Hive


Understanding Hive Components
Installing and Configuring Hive with MySQL as Metastore
HiveServer2 and beeline
Authentication for beeline clients using PAM
Creating Hive Credential Store

What is Apache Hive


Apache Hive is a data warehousing solution used on top of the Hadoop cluster
that uses HQL as a language to perform queries and operations on the data stored
in HDFS. HQL i.e. Hive Query Language is a query language that is derived
from SQL. However, not all SQL queries are supported in Hive. Hive was
created by Facebook with the intent of working on Hadoop using something like
a traditional data warehousing solution using SQL.

Understanding Hive Components


When it comes to Apache Hive, it typically comprises of three components,
namely:

a. Metastore Server
b. Storage Layer
c. Processing Engine

Metastore Server in Apache Hive


A Metastore server can be any Java-compliant database server. The role of the
Metastore server is to maintain the metadata information of all Hive objects
created by the system or the user. It is the Metastore that helps Apache Hive
integrate with multiple processing engines like Spark and Tez. The Metastore
server informs all processing engines where the data exists. It also informs the
data dictionary about the Hive objects.
Storage Part in Hive
The storage part of Hive is the HDFS. Hive internally uses HDFS storage for all
its storage related operations. The default warehouse location of Hive in HDFS
is/user/hive/warehouse directory. This default location is configurable.
Processing Part in Hive
The processing part in Apache Hive is responsible for performing all process
related operations for Apache Hive. Apache Hive, by default, doesn’t
bundle/contain any in-built processing capabilities unlike Impala or HAWQ. It
uses the default MR processing on top of YARN (in case of Generation2
Hadoop). However, we can simultaneously configure the other processing
engines like Apache Tez and Apache Spark.

Installing Apache Hive with MySQL Metastore Database We now


have an idea about the components of Hive. Let us perform the
installation and configuration of Apache Hive in a single node
cluster.
Lab 35 Setting up Apache Hive with MySQL Database as a Metastore
Server Problem Statement1: Apache Hive Installation. In this example,
we will use MySQL as the Metastore server.

Solution: Even though we are writing steps for a single node cluster, we can
perform the same even in a Multinode cluster by installing Hive either in
NameNode system or the Edge server of your cluster.

Installation Steps
Step 1: Ensure Hadoop is installed and configured.
Step 2: Install MySQL in Ubuntu.
sudo apt-get install mysql-server During the installation process, you
may be asked to set the password for the root user. For usability, let us
assume that the password set is ‘123456’ since we will be using it in hive-
site.xml later.
Step 3: Download and extract the Apache Hive binaries
wget http://www-eu.apache.org/dist/hive/hive-1.2.2/apache-hive-
1.2.2-bin.tar.gz
tar -xvzf apache-hive-1.2.2-bin.tar.gz Step 4: Rename the extracted
folder (for convenience purposes only).
mv apache-hive-1.2.2 hive
Step 5: Inform the system about Hive by setting all the necessary environment
variables vi /home/hadoop/.bashrc
export HIVE_PREFIX=/home/hadoop/hive export
PATH=$PATH:$HIVE_PREFIX/bin
export PATH=$PATH:$HIVE_PREFIX/conf Now Save it!!!
Step 6: Update the bash and change the ownership of the folders exec bash
Step 7: Configure hive-env.sh.
vi /home/hadoop/hive/conf/hive-env.sh export
HADOOP_HOME=/home/hadoop/hadoop2
export HIVE_CONF_DIR=/home/hadoop/hive/conf Step 8:
Configure hive-site.xml.
vi /home/hadoop/hive/conf/hive-site.xml <?xml version=”1.0”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoopvm:3306/hivemetadb?
createDatabaseIfNotExist=true</value> <!-- Here
hadoopvm is the hostname of the system where MySQL is
installed --> </property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value> </property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<!-- Here root is the username of MySQL Server --> </property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<!-- Here 123456 is the password of MySQL Server -->
</property>
</configuration>
Step 9: Configure the Metastore server to accept remote connections from any
host. This step is one of the most important steps, especially when configuring
Hive in a MultiNode cluster or in an Edge server.

a. Stop MySQL services


sudo service mysql stop
b. Edit the configuration file of MySQL
sudo vi /etc/mysql/my.cnf
Look for bind_address. The default configuration will be

In this case, hadoopvm is the hostname of the machine where MySQL


server is installed. When doing this by yourself, please put the correct
hostname and save the file.
c. Start MySQL service
sudo service mysql start
d. Start MySQL Client using the following command mysql -u root -
p123456

e. Grant privileges to root user


mysql> GRANT ALL PRIVILEGES ON *.* TO
root@’hadoopvm’ IDENTIFIED BY ‘123456’; Query OK,
0 rows affected (0.00 sec) mysql> exit Bye
Step 10: Download and Install MySQL Java Driver in Apache Hive.
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-
connector-java-5.1.42.tar.gz
tar -xvzf mysql-connector-java-5.1.42.tar.gz cp mysql-connector-
java-5.1.42/mysql-connector-java-5.1.42-bin.jar
/home/hadoop/hive/lib/.
Step 11: Create a tmp directory in HDFS to maintain the staging data of Apache
Hive.
hdfs dfs -mkdir -p /tmp/hive
hdfs dfs -chmod 777 /tmp/hive
Step 12: Start the Hadoop Service now if you have not started it earlier.
Step 13: Start Hive Shell
hive
If you get the hive prompt, type the following commands to test Hive
installation: show databases;
This must show default database named ‘default’.
create database bigdataclassmumbai; hive> show databases;
OK
default
Time taken: 0.028 seconds, Fetched: 1 row(s) hive> create
database bigdataclassmumbai; OK
Time taken: 0.096 seconds
hive> show databases;
OK
bigdataclassmumbai
default
Time taken: 0.014 seconds, Fetched: 2 row(s) This will now create a
new database in Hive. You can check whether it is successfully created or
not using the ‘show databases’ command. If everything is fully functional,
we can say that Hive has been installed successfully.
Understanding HiveServer2 Service
In the above steps, we used a Hive shell i.e. Hive CLI. It is a command-line
interface that is used to perform queries about the data present in HDFS.
However, one of biggest drawbacks of this tool is that there exists no
concurrency support. Multiple simultaneous user connections are not possible in
Hive CLI. Ideally, Hive CLI interacts with HiveServer1 where HiveCLI is
responsible for accepting HQL queries from user and HiveServer1 is responsible
for monitoring and compiling MR jobs that are triggered by Hive.
Around 2014, Hive developers introduced HiveServer2 that supports multi-
client concurrency and authentication. Hive Server2 uses Beeline as a client tool.
Beeline interacts with HiveServer2 to perform all processing, concurrency,
authentication, and authorization related requests. The block diagram below
gives a brief glance of a HiveServer2 operation.

Hive Server2 internally creates different Drivers for each session that then
converses with the Metastore to fetch the metadata. HiveServer2 allows you to
interact with Hive using JDBC, ODBC, Beeline, and so on. Let us now use
HiveServer2 in our setup.

Connecting Beeline to HiveServer2


Lab 36 Connecting Beeline Client to HiveServer2

Problem Statement2: Using HiveServer2 and Beeline to interact with Hive.


Solution
Step 1: Ensure Hive is installed and Hadoop services are running.
Step 2: To Start HiveServer2 type the following command,
hiveserver2 &
Please make a special note of the process id for the same. You will need the
process id to kill/terminate the process. You will specifically need this id for our
next problem statement.
Step 3: Now ensure hiveserver2 is listening on port 10000 (default port)

Step 4: Let us now start the Beeline client tool.


beeline
Beeline version 1.6.3 by Apache Hive beeline>
Step 5: Now let us connect Beeline to HiveServer2. It will prompt for username
and password. At this juncture, do not type anything. Just press enter till you get
a prompt that is connected to 10000 as shown below: !connect
jdbc:hive2://hadoopvm:10000
Connecting to jdbc:hive2://hadoopvm:10000
Enter username for jdbc:hive2://hadoopvm:10000: Enter
password for jdbc:hive2://hadoopvm:10000: 17/06/25 12:35:38
INFO jdbc.Utils: Supplied authorities: hadoopvm:10000
17/06/25 12:35:38 INFO jdbc.Utils: Resolved authority:
hadoopvm:10000
17/06/25 12:35:38 INFO jdbc.HiveConnection: Will try to open
client transport with JDBC Uri: jdbc:hive2://hadoopvm:10000
Connected to: Apache Hive (version 2.1.1) Driver: Spark Project
Core (version 1.6.3) Transaction isolation:
TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoopvm:10000> Step 6: Let us check whether the
database that we created after installation of Apache Hive still exists and if
it is accessible or not.
show databases;
OK

2 rows selected (1.448 seconds) So, you now know how to interact with
Hive using HiveServer2 and Beeline. If you wish to exit Beeline the
command to be typed is ‘!q’ Refer JIRA
https://issues.apache.org/jira/browse/HIVE-4557

Securing Beeline Access to HiveServer2


One of the advancements of the HiveServer2 is multiple authentication
mechanism support. Hiveserver2 currently supports PAM, Kerberos, LDAP and
Custom authentication mechanism. By default the authentication is disabled
(NONE). Following are the parameters responsible for enabling the
Authentication mechanism: PAM (Pluggable Authentication Module)
Property
Property Name Explanation
Value
hive.server2.authentication PAM This parameter is used to
set PAM as your
authentication mechanism
hive.server2.authentication.pam.services login,sshd This is to set the list of
PAM services that can be
used. Ensure the entry
there in/etc/pam.d file

Kerberos
Property Name Property Value
hive.server2.authentication KERBEROS

hive.server2.authentication.kerberos.principal hive/_HOST@YOUR_REALM.COM

hive.server2.authenication.kerberos.keytab /home/hadoop/hive.keytab

LDAP
Property Name Property Value Explanation
hive.server2.authentication LDAP This
parameter is
used to set
LDAP as your
authentication
mechanism
hive.server2.authentication.ldap.url LDAP_URL This is to set
the LDAP
server URL
hive.server2.authenication.ldap.Domain ActiveDirectory_Domain_address This is used if
you use
Active
Directory as
your LDAP
Authentication
mechanism
hive.server2.authentication.ldap.baseDN OpenLDAP_baseDN This is used
when you
implement
LDAP using
OpenLDAP

In this book we will limit our implementation to PAM.


Lab 37 Configuring Hiveserver2 to Secure Beeline Client Access Problem
Statement: In the above example, we have seen that we can access Hive
without any credentials. This is not a recommended setup for
production. Let us setup HiveServer2 authentication and enable the
same.

Solution
Step 1: Stop HiveServer2. To do so, kill the process
kill -9 <processed_hiveserver2>
e.g. kill -9 3659
Step 2: Setup hive-site.xml to enable PAM. Append the following properties in
existing hive-site.xml under <configuration> tab.
vi /home/hadoop/hive/conf/hive-site.xml <property>
<name>hive.server2.authentication</name>
<value>PAM</value>
</property>
<property>
<name>hive.server2.authentication.pam.services</name>
<value>login,sshd</value> </property>
Step 3: Configure JPAM. This is one of the most important steps. Failure to do
this will result in errors during the authentication. Download JPAM 64bit. If
your machine is 32bit then use 32bit JPAM package.
Step 4: Extract the package JPam-Linux_amd64-1.1.tgz
tar -xvf JPam-Linux_amd64-1.1.tgz Step 5: Copy libjpam.so in the lib
folder of Hadoop. Technically we must copy the same to java.library.path.
You can figure out the same by doing the following command ps -ef | grep
10096
hadoop 10096 4466 13 21:42 pts/2 00:00:07 /usr/lib/jvm/java-7-
oracle//bin/java -Xmx256m -
Djava.library.path=/home/hadoop/hadoop2/lib -
Djava.net.preferIPv4Stack=true - …
where 10096 is the process id of HiveServer2
In the above output we figured out the location of java.library.path is
/home/hadoop/hadoop2/lib. Copy libjpam.so in that location.
Step 6: Start HiveServer2 service
hiveserver2 &
Step 7: Open Beeline and try connecting it without entering credentials.
!connect jdbc:hive2://hadoopvm:10000
Connecting to jdbc:hive2://hadoopvm:10000
Enter username for jdbc:hive2://hadoopvm:10000: Enter
password for jdbc:hive2://hadoopvm:10000: 17/06/29 21:48:02
INFO jdbc.Utils: Supplied authorities: hadoopvm:10000
17/06/29 21:48:02 INFO jdbc.Utils: Resolved authority:
hadoopvm:10000
17/06/29 21:48:02 INFO jdbc.HiveConnection: Will try to open
client transport with JDBC Uri: jdbc:hive2://hadoopvm:10000
17/06/29 21:48:06 INFO jdbc.HiveConnection: Could not open
client transport with JDBC Uri: jdbc:hive2://hadoopvm:10000
17/06/29 21:48:06 INFO jdbc.HiveConnection: Transport Used for
JDBC connection: null Error: Could not open client transport
with JDBC Uri: jdbc:hive2://hadoopvm:10000: Peer indicated
failure: Error validating the login (state=08S01,code=0) If you can
view the above error, “Error validating the login”, it means that your setup
is successful and the connection is now secure. Let us now try to enter the
OS username and its password which in our case is ‘hadoop’ and ‘123456’.
!connect jdbc:hive2://hadoopvm:10000
Connecting to jdbc:hive2://hadoopvm:10000
Enter username for jdbc:hive2://hadoopvm:10000: hadoop Enter
password for jdbc:hive2://hadoopvm:10000: ******
17/06/29 21:48:15 INFO jdbc.Utils: Supplied authorities:
hadoopvm:10000
17/06/29 21:48:15 INFO jdbc.Utils: Resolved authority:
hadoopvm:10000
17/06/29 21:48:15 INFO jdbc.HiveConnection: Will try to open
client transport with JDBC Uri: jdbc:hive2://hadoopvm:10000
Connected to: Apache Hive (version 1.2.2) Driver: Spark Project
Core (version 1.6.3) Transaction isolation:
TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://hadoopvm:10000> where 10096 is the process id of
HiveServer2
You will see that this time, the authentication is successful and you can now
access Beeline to perform HQL operations.

Implementing Access Control Using the Hive Credential Store In


this section, we will see how to configure the Hive credential store
and achieve access control.
Lab 38 Configuring Hive Credential Store

Setup the following parameters in hive-site.xml.


<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.users.in.admin.role</name>
<value>root</value>
</property>
<property>
<name>hive.security.metastore.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.StorageBas
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlst
</property>
Create another configuration file named hiveserver2-site.xml
<?xml version=”1.0”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlst
</property>
<property>
<name>hive.security.authenticator.manager</name>
<value>org.apache.hadoop.hive.ql.security.SessionStateUserAuthent
</property>
<property>
<name>hive.conf.restricted.list</name>
<value>hive.security.authorization.enabled,hive.security.authorizatio
</property>
</configuration>
Now restart Hive and perform the following:
0: jdbc:hive2://hadoopvm:10000> set role ADMIN; OK
No rows affected (0.302 seconds) 0:
jdbc:hive2://hadoopvm:10000> show roles; OK
0: jdbc:hive2://hadoopvm:10000> create role my_admin; OK
No rows affected (0.073 seconds)
0: jdbc:hive2://hadoopvm:10000> show roles; OK

0: jdbc:hive2://hadoopvm:10000> grant my_admin to user


hadoop with admin option; OK
No rows affected (0.047 seconds)
0: jdbc:hive2://hadoopvm:10000> show role grant user hadoop;
OK

0: jdbc:hive2://hadoopvm:10000> show principals my_admin; OK


Summary

Apache Hive is a data warehousing solution used on top of Hadoop


cluster that uses HQL to perform queries and operations over the existing
data.
The role of the Metastore server is to maintain all the metadata
information of all the Hive objects created by the system or the user.
HiveServer2 supports multi-client concurrency and authentication.
HiveServer2 currently supports PAM, Kerberos, LDAP and Custom
authentication mechanisms. By default, the authentication is disabled
(NONE).
Chapter 9
Apache HBase Admin Basics

In this chapter, we will learn:

What is HBase.
HBase v/s HDFS.
HBase v/s RDBMS.
HBase Architecture, Daemons and Components.
Installing SingleNode HBase cluster.
Installing Single-Master MultiNode HBase cluster.
Installing Multiple-Master MultiNode HBase cluster.
Introducing HBase Shell.
Common HBase Admin Commands.
Hive-HBase Integration.
Bulk Loading data in HBase using Hive.

Introducing Apache HBase


Apache HBase is one of the most crucial eco-system components due to its great
features. Apache HBase is a NoSQL compliant database that follows the CP
model of the CAP Theorem. CP represents the fact that this tool guarantees
Consistency and Partition-tolerance over Availability.
Apache HBase is an open-source, distributed, versioned, non-relational
database that offers random, real time, read/write access to data stored in HDFS
and can be used to host very large tables in commodity hardware.
Some of the features of HBase include:

Scalable Architectural Components.


Supports fully consistent reads and writes.
Automatic and configurable sharding of tables.
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with
Apache HBase tables.
Block cache and Bloom Filters for real-time queries.
Comparing HBase with HDFS
HDFS (Hadoop Distributed File System) HBase
It follows write once and reads many It follows random read/write
concepts approach with versioning of each
write
HDFS depends on local FileSystems like HBase relies on HDFS for its
ext4 and other Hadoop-compatible storage
FileSystems for its storage
HDFS does not support updates HBase adds transactional capability
to HDFS that allows updates based
on versioning
HDFS is a filesystem HBase is a NoSQL database
HDFS stores the data in the form of blocks HBase stores data in table row
column (Column oriented
distributed data store)

Comparing HBase with RDBMS


HBase RDBMS
HBase follows a flexible data model. RDBMS requires the data model to be
Columns are created dynamically designed and implemented upfront
whenever the data is inserted. before inserting the data.
HBase is not ACID compliant. Refer RDBMS is ACID compliant.
https://hbase.apache.org/acid-
semantics.html for details.
HBase is designed for web scale. RDBMS is not designed for web-scale
HBase, by default, takes care of In RDBMS, we are forced to do
Sharding sharding when we scale our DB.

HBase Architecture, Daemons and Components


Let us first understand the daemons that are associated with HBase. There exist 3
daemons. They are:

HBase Master – It is also called the HMaster. It is responsible for


Region Assignment, Region Re-assignment, DDL operations like
create, delete, and so on.
HBase RegionServer – It is also called the HRegionServer. It is
responsible for serving data reads and writes.
Zookeeper – It is also called the QuorumPeerMain. It is responsible for
maintaining the live cluster state.

The diagram below depicts a typical HA-enabled HBase cluster:

HBase is dependent on HDFS services. NameNode service is responsible for


maintaining metadata of the data that is stored in the HDFS. HBase uses HDFS
to find and create blocks wherever applicable based on the operations. DataNode
service is responsible for maintaining the actual physical blocks that are
managed by the RegionServer.
The HBase RegionServer maintains a set of Regions. Ideally, HBase tables are
divided horizontally by row keys called ‘Regions’. A region holds all rows of the
given table between the start and end keys of the region. Whenever a table is
created, a region is created in one DataNode. As the data increases, it gets
sharded automatically resulting in maintenance of multiple regions in the cluster
for a single table. Ideally, a RegionServer can serve up to 1000 regions.
HBase HMaster is responsible for coordinating with the RegionServers in
terms of assigning regions on startup, re-assigning regions for recovery, and load
balancing as well as monitoring RegionServers.
Zookeeper, on the other hand, is responsible for maintaining the state of the
HBase cluster. When it comes to HBase client applications, Zookeeper ensemble
is the point of contact for applications. Zookeeper maintains a record of which
nodes in the HBase cluster are alive. Zookeeper also maintains the location of
the META table. The HBase META table is a table that maintains a list of all
regions in the system. It’s a B-Tree that maintains K, V pair information where K
is the region start key, region if and V is the RegionServer Name/IP.

Installing and Configuring Apache HBase in Single Node Cluster


In this section, we will see how to install Apache HBase in a single
node cluster using Embedded Zookeeper. HBase stands for
Hadoop DataBase. It is responsible for providing a realtime write
layer on top of HDFS. We can perform realtime inserts, deletes,
and updates in HBase, and can emit the data in REST API. Let us
see how to install the same.
Lab 39 Installing and Configuring Apache HBase in Single Node Cluster
Step 1: Ensure Hadoop is installed in a single node cluster. Stop
Hadoop services if running. You can refer to Chapter2 of this book to
learn how to install Apache Hadoop in Single Node cluster.

Step 2: Download and extract the Apache HBase binaries.


wget http://www-us.apache.org/dist/hbase/1.3.1/hbase-1.3.1-
bin.tar.gz
tar -xvzf hbase-1.3.1-bin.tar.gz Step 3: Rename the extracted folder (for
convenience purposes) mv hbase-1.3.1 hbase
Step 4: Inform the system about HBase. This can be done by setting up
environment variables in bashrc file. Append the following lines: vi
/home/hadoop/.bashrc
export HBASE_PREFIX=/home/hadoop/hbase export
PATH=$PATH:$HBASE_PREFIX/bin
export HBASE_HOME=$HBASE_PREFIX
export HBASE_LOGS=$HBASE_HOME/logs Save it !
Step 5: Update the bash
exec bash
Step 6: Configure hbase-env.sh
vi /home/hadoop/hbase/conf/hbase-env.sh export
JAVA_HOME=/usr/lib/jvm/java-7-oracle export
HBASE_PID_DIR=/home/hadoop/hbase/pids export
HBASE_MANAGES_ZK=true
Step 7: Once set, create pid directory required as per the above configuration.
mkdir -p /home/hadoop/hbase/pids Step 8: Configure hbase-site.xml
vi /home/hadoop/hbase/conf/hbase-site.xml <?xml version=”1.0”?
>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoopvm:8020/hbase</value> </property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>hdfs://hadoopvm:8020/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoopvm</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>
Step 9: Configure RegionServers
vi /home/hadoop/hbase/regionservers
hadoopvm
Step 10: Start Hadoop services
Step 11: Start HBase services
start-hbase.sh
hadoopvm: starting zookeeper, logging to
/home/hadoop/hbase/logs/hbase-hadoop-zookeeper-hadoopvm.out
starting master, logging to /home/hadoop/hbase/logs/hbase-
hadoop-master-hadoopvm.out hadoopvm: starting regionserver,
logging to /home/hadoop/hbase/logs/hbase-hadoop-regionserver-
hadoopvm.out The above command must start Zookeeper, HMaster, and
HRegionServer.
Step 12: Ensure services are live and active.
jps
1407 NameNode
1578 DataNode
25798 HRegionServer
1681 ResourceManager
25543 HQuorumPeer
26167 Jps
25635 HMaster
2013 NodeManager
Step 13: If the services are alive, it means the installation is successful. Check
the WebUI http://192.168.149.137:16010
where,
192.168.149.137 is the IP address of the machine in which
HMaster service is running.
16010 is the default WebUI port number.
If you get the webpage, it means all services are working normal. Now you
know how to configure HBase with embedded Zookeeper.

Installing and Configuring Apache HBase Single Master in


Multinode Cluster In this section, we will see how to install
Apache HBase in a MultiNode cluster using Zookeeper Ensemble.
Lab 40 Installing and Configuring Apache HBase Single HMaster
Multinode Cluster Following is the kind of setup I am going to
implement.
Even though we are doing this setup in a HA-enabled cluster, you can perform
the steps in a non-HA enabled cluster also.
Solution
Step 1: Stop Hadoop services if running.
stop-all.sh
Step 2: Download and extract the Apache HBase binaries in node1.mylabs.com.
wget http://www-us.apache.org/dist/hbase/1.3.1/hbase-1.3.1-
bin.tar.gz
tar -xvzf hbase-1.3.1-bin.tar.gz Step 3: Rename the extracted folder (for
convenience purposes) and copy this folder in all 4 systems.
mv hbase-1.3.1 hbase
scp -r /home/hadoop/hbase
hadoop@node2.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/hbase
hadoop@node3.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/hbase
hadoop@node4.mylabs.com:/home/hadoop/.
Step 4: Inform the system about HBase in all 4 systems. This can be done by
setting up environment variables in bashrc file. Append the following lines: vi
/home/hadoop/.bashrc
export HBASE_PREFIX=/home/hadoop/hbase export
PATH=$PATH:$HBASE_PREFIX/bin
export HBASE_HOME=$HBASE_PREFIX
export HBASE_LOGS=$HBASE_HOME/logs Now Save it !!!
scp -r /home/hadoop/.bashrc
hadoop@node2.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/.bashrc
hadoop@node3.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/.bashrc
hadoop@node4.mylabs.com:/home/hadoop/.
Step 5: Update the bash in all 4 systems.
exec bash
Step 6: Configure hbase-env.sh in all 4 systems
vi /home/hadoop/hbase/conf/hbase-env.sh export
JAVA_HOME=/usr/lib/jvm/java-7-oracle export
HBASE_PID_DIR=/home/hadoop/hbase/pids export
HBASE_MANAGES_ZK=false
Now Save it !!!
scp -r /home/hadoop/hbase/conf/hbase-env.sh
hadoop@node2.mylabs.com:/home/hadoop/hbase/conf/hbase-
env.sh scp -r /home/hadoop/hbase/conf/hbase-env.sh
hadoop@node3.mylabs.com:/home/hadoop/hbase/conf/hbase-
env.sh scp -r /home/hadoop/hbase/conf/hbase-env.sh
hadoop@node3.mylabs.com:/home/hadoop/hbase/conf/hbase-
env.sh Step 7: Once set, create pid directory required as per the above
configuration in all 4 systems.
mkdir -p /home/hadoop/hbase/pids Step 8: Configure hbase-site.xml
with the following properties in HMaster system. In my case, its
node1.mylabs.com vi /home/hadoop/hbase/conf/hbase-site.xml <?xml
version=”1.0”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://cognitocluster/hbase</value> </property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>hdfs://hadoopvm:8020/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node1.mylabs.com,node2.mylabs.com,node3.mylabs.com</va
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>
Step 9: Configure hbase-site.xml with the following properties in
HRegionServer systems. In my case, it is node3.mylabs.com &
node4.mylabs.com.
vi /home/hadoop/hbase/conf/hbase-site.xml <?xml version=”1.0”?
>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>hbase.rootdir</name> <value>
hdfs://cognitocluster/hbase</value> </property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
Step 10: Configure RegionServers
vi /home/hadoop/hbase/regionservers
node3.mylabs.com
node4.mylabs.com
Step 11: Start Zookeeper Service in node1.mylabs.com, node2.mylabs.com, and
node3.mylabs.com. We will assume that you are also working on a HA enabled
cluster and Zookeeper is installed. In case you are working on a simple
MultiNode cluster, please install Zookeeper on node1.mylabs.com,
node2.mylabs.com, and node3.mylabs.com as per the instruction given in the
Zookeeper chapter.
cd zookeeper-3.4.6
bin/zkServer.sh start
bin/zkServer.sh status
Step 12: Start Hadoop services
start-all.sh
Step 13: Start HBase services
start-hbase.sh
The above command must start HMaster and set of HRegionServer service in
node3 and node4.
Step 13: If the services are alive, it means the installation is successful. Check
the WebUI http://node1.mylabs.com:16010
where,
16010 is the default WebUI port number.
If you get the webpage and see all RegionServers joined, it means all services
are working normal.

Installing and Configuring Apache HBase Multi Master in


MultiNode Cluster In this section, we will see how to install
Apache HBase Multi master in a MultiNode cluster using
Zookeeper Ensemble.
Lab 41 Installing and Configuring Apache HBase Multiple HMaster for HA
in a Multinode Cluster Following is the kind of setup we are going to
implement.
When it comes to HBase Multi-master configuration, it is recommended to
maintain such configuration in a HA enabled cluster.
Solution
Step 1: Stop Hadoop services if running.
stop-all.sh
Step 2: Download and extract the Apache HBase binaries in
node1.mylabs.com.
wget http://www-us.apache.org/dist/hbase/1.3.1/hbase-1.3.1-
bin.tar.gz
tar -xvzf hbase-1.3.1-bin.tar.gz Step 3: Rename the extracted folder (for
convenience purposes) and copy this folder in all 4 systems.
mv hbase-1.3.1 hbase
scp -r /home/hadoop/hbase
hadoop@node2.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/hbase
hadoop@node3.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/hbase
hadoop@node4.mylabs.com:/home/hadoop/.
Step 4: Inform the system about HBase in all 4 systems. This can be done by
setting up environment variables in bashrc file. Append the following lines: vi
/home/hadoop/.bashrc
export HBASE_PREFIX=/home/hadoop/hbase export
PATH=$PATH:$HBASE_PREFIX/bin
export HBASE_HOME=$HBASE_PREFIX
export HBASE_LOGS=$HBASE_HOME/logs Now Save it !!!
scp -r /home/hadoop/.bashrc
hadoop@node2.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/.bashrc
hadoop@node3.mylabs.com:/home/hadoop/.
scp -r /home/hadoop/.bashrc
hadoop@node4.mylabs.com:/home/hadoop/.
Step 5: Update the bash in all 4 systems.
exec bash
Step 6: Configure hbase-env.sh in all 4 systems
vi /home/hadoop/hbase/conf/hbase-env.sh export
JAVA_HOME=/usr/lib/jvm/java-7-oracle export
HBASE_PID_DIR=/home/hadoop/hbase/pids export
HBASE_MANAGES_ZK=false
Now Save it !!!
scp -r /home/hadoop/hbase/conf/hbase-env.sh
hadoop@node2.mylabs.com:/home/hadoop/hbase/conf/hbase-
env.sh scp -r /home/hadoop/hbase/conf/hbase-env.sh
hadoop@node3.mylabs.com:/home/hadoop/hbase/conf/hbase-
env.sh scp -r /home/hadoop/hbase/conf/hbase-env.sh
hadoop@node3.mylabs.com:/home/hadoop/hbase/conf/hbase-
env.sh Step 7: Once set, create pid directory required as per the above
configuration in all 4 systems.
mkdir -p /home/hadoop/hbase/pids Step 8: Configure hbase-site.xml
with the following properties in HMaster system. In my case, its
node1.mylabs.com and node2.mylabs.com vi
/home/hadoop/hbase/conf/hbase-site.xml <?xml version=”1.0”?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://cognitocluster/hbase</value> </property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>hdfs://hadoopvm:8020/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node1.mylabs.com,node2.mylabs.com,node3.mylabs.com</va
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property> <name>zookeeper.session.timeout</name>
<value>180000</value>
</property>
</configuration>
Step 9: Configure hbase-site.xml with the following properties in
HRegionServer systems. In my case, it is node3.mylabs.com &
node4.mylabs.com.
vi /home/hadoop/hbase/conf/hbase-site.xml <?xml version=”1.0”?
>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>hbase.rootdir</name> <value>
hdfs://cognitocluster/hbase</value> </property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
Step 10: Configure RegionServers in node1.mylabs.com and node2.mylabs.com
vi /home/hadoop/hbase/regionservers node3.mylabs.com
node4.mylabs.com
Step 11: Configure Backup HMaster in node1.mylabs.com and
node2.mylabs.com.
vi /home/hadoop/hbase/conf/backup-masters node2.mylabs.com
Step 12: Start Zookeeper Service in node1.mylabs.com, node2.mylabs.com, and
node3.mylabs.com. We will assume that you are also working on a HA enabled
cluster and Zookeeper is installed. In case you are working on a simple
MultiNode cluster, please install Zookeeper on node1.mylabs.com,
node2.mylabs.com, and node3.mylabs.com as per the instruction given in the
Zookeeper chapter.
cd zookeeper-3.4.6
bin/zkServer.sh start
bin/zkServer.sh status
Step 13: Start Hadoop services
start-all.sh
Step 14: Start HBase Main Master service in node1.mylabs.com hbase-
daemon.sh start master
Step 15: Start HBase Backup Master service in node2.mylabs.com hbase-
daemon.sh start master -backup Step 16: Start RegionServers from
node1.mylabs.com
hbase-daemons.sh start regionserver The above command must start
HMaster and set of HRegionServer service in node3 and node4.
Step 17: If the services are alive, it means the installation is successful. Check
the WebUI http://node1.mylabs.com:16010
where,
16010 is the default WebUI port number.
If you get the webpage and see all RegionServers joined and backup master
entry, it means all services are working normal.

Introducing HBase Shell


HBase Shell is a client application that can be used to interact with HBase. In
this section we will introduce some of the commands that an admin must know.
Lab 42 Working with HBase Shell

Let us start working with HBase using HBase shell.


To start HBase shell type the following command:
hbase shell
Type “exit<RETURN>” to leave the HBase Shell Version 1.3.1,
rUnknown, Mon Aug 4 23:58:06 PDT 2014
hbase(main):001:0>
Let us create a table named ‘bigdataclassmumbai’ with ‘employee’ column-
family.
hbase(main):001:0> create ‘bigdataclassmumbai’,’employee’
0 row(s) in 2.1980 seconds
=> Hbase::Table – bigdataclassmumbai Let us now insert a record as
per the illustration below:

empid fname Lname age role


1 Prashant Nair 30 Engineer

When we see an ideal HBase model, all the data is stored in columns. That is
why HBase is known as a columnar database. In the previous command, we
created a table and a column-family. The column-family holds all dynamic
columns. Also in HBase, the row key is an essential entity. Ideally, whichever
field contains unique data, can act as row key. Think of a Primary Key field. Let
us see how to insert this record in HBase.
hbase(main):002:0> put
‘bigdataclassmumbai’,1,’employee:fname’,’Prashant’
0 row(s) in 0.1410 seconds
hbase(main):003:0> put
‘bigdataclassmumbai’,1,’employee:lname’,’Nair’
0 row(s) in 0.0050 seconds
hbase(main):004:0> put
‘bigdataclassmumbai’,1,’employee:age’,’30’
0 row(s) in 0.0040 seconds
hbase(main):005:0> put
‘bigdataclassmumbai’,1,’employee:role’,’Engineer’
0 row(s) in 0.0090 seconds
To list the contents of the table:
To check the status of the cluster:
hbase(main):008:0> status
2 servers, 0 dead, 1.5000 average load Common HBase Admin
Commands
Lab 43 Performing Hive-HBase Integration for Data Interaction

HBase ideally provides CLI-based commands to perform the administration of


HBase. Let us explore some of the commands: 1. Performing file system check
for HBase storage.
2. Dropping a column-family
hbase(main):001:0> create
‘example1’,’cf1’,’cf2’
hbase(main):002:0> alter
‘example1’,’delete’ => ‘cf2’
hbase(main):004:0> describe ‘example1’
3. Disabling and Enabling a table hbase(main):001:0> disable
‘bigdataclassmumbai’
hbase(main):002:0> enable
‘bigdataclassmumbai’
4. Truncating a table. In HBase it drops and recreates table hbase(main):004:0>
truncate
‘bigdataclassmumbai’
Truncating ‘bigdataclassmumbai’ table (it may take a while): -
Disabling table...
- Dropping table...
- Creating table...
0 row(s) in 1.6660 seconds
5. Counting number of rows in the table
hbase(main):005:0> count
‘bigdataclassmumbai’
6. Performing Bulk Import in HBase
hbase(main):006:0> create ‘hr’,’emp’
0 row(s) in 0.1750 seconds
=> Hbase::Table – hr
Performing Hive-HBase Integration
Most people get extremely frustrated when it comes to working with the
traditional native HBase commands for data interaction with HBase. Don’t
worry – don’t stress! In this section, we will see how to perform Hive-HBase
integration.
Lab 44 Bulk Loading the Data in HBase Using Apache Hive

Step 1: Ensure Hive and HBase is installed. For this blog, we will include and
write this step keeping in mind a Single Node cluster. However the steps are the
same in a MultiNode cluster also.
Step 2: Ensure Hadoop and HBase services are live and active Step 3: Start
HBase shell and create a table in HBase which we will integrate with Hive
hbase shell
Version 1.3.1, r930b9a55528fe45d8edce7af42fef2d35e77677a, Thu
Apr 6 19:36:54 PDT 2017
hbase(main):001:0> create ‘hr_hbase’,’employee’
0 row(s) in 1.4930 seconds
=> Hbase::Table - hr_hbase
Step 4: Let us insert some data in the employee column family. For this scenario,
we will assume empid will be the row key.
2 row(s) in 0.0360 seconds
hbase(main):013:0> exit
Step 5: Start Hive. Create an external table and link the same with the HBase
table.
hive
hive> show databases;
OK
bigdataclassmumbai
default
Time taken: 0.02 seconds, Fetched: 2 row(s)
OK
Time taken: 0.759 seconds
Step 6: Since we have already entered data in HBase, let us check to confirm
whether the same data is reflected in Hive.
hive> show tables;
OK
hr_hive
Time taken: 0.037 seconds, Fetched: 1 row(s) hive> select * from
hr_hive;
OK

Time taken: 0.781 seconds, Fetched: 2 row(s) That is how you


complete a Successful integration.
Bulk Loading Data in an HBase Table Using Hive Let us
understand the current challenge presented. Let us try doing bulk
uploads in Hive and check the data in HBase.
Lab 45 Bulk Loading Delimited File Directly in HBase

Step 1: Create a file named emp, add some relevant contents.


vi /home/hadoop/emp
3,Mithoon,C,Chennai,TamilNadu,India
4,Alex,Smith,Noida,Haryana,India Now Save it !!!
Step 2: Upload the same in Hive.
LOAD DATA LOCAL INPATH ‘/home/hadoop/emp’ INTO
TABLE hr_hive; You will get this kind of error:
hive> LOAD DATA LOCAL INPATH ‘/home/hadoop/emp’ INTO
TABLE hr_hive; FAILED: SemanticException [Error 10101]: A
non-native table cannot be used as target for LOAD
Just because you get the above prompt does not mean that your command is
wrong. The command is correct but is only applicable only for Hive tables. It is
not meant for a table with a non-native storage.
Let us now learn how to do a bulk upload from Hive in an HBase table.
Step 1: Assuming your emp file is available in /home/hadoop and you are in a
Hive shell. Create a staging table and load the data in the staging table.
OK
Time taken: 0.341 seconds
hive> LOAD DATA LOCAL INPATH ‘/home/hadoop/emp’ INTO
TABLE hr_hive_staging; Loading data to table
bigdataclassmumbai.hr_hive_staging Table
bigdataclassmumbai.hr_hive_staging stats: [numFiles=1,
totalSize=69]
OK

Step 2: Now insert the data from staging table to the actual table that is linked
with Hive.

To check insertion, you can do select * from hr_hive. Also you can check the
same in HBase shell.

Bulk Loading Delimited File in HBase


In this section we will see how to upload a delimited file in HBase using
importTsv command.
Step 1: Create a sample file as shown below:
vi /home/hadoop/emp
1,Prashant,N,Mumbai,Maharashtra,India
2,Utkarsha,M,Aurangabad,Maharashtra,India Step 2: Start HBase
and create table and column family
hbase shell
hbase(main):002:0> create ‘hr’,’emp’
0 row(s) in 1.3230 seconds
Step 3: Create a directory in HDFS and upload the file in the same directory.
hdfs dfs -mkdir /hbaseupload
hdfs dfs -put emp /hbaseupload
hdfs dfs -cat /hbaseupload/emp
Step 4: Upload the file using the following command
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -
Dimporttsv.separator=, -
Dimporttsv.columns=”HBASE_ROW_KEY,emp:fname,emp:lname,emp:city
hr hdfs:/hbaseupload/emp Step 5: Verify whether the data is uploaded
successfully or not.
hbase shell
scan ‘hr’
Note: It is important to ensure Hadoop version and HBase version are
compatible with each other before performing the command.

Summary

Apache HBase is a NoSQL compliant database that follows the CP model


of CAP Theorem.
HBase uses HDFS for its storage.
HMaster is responsible for coordinating with the RegionServers in terms
of assigning regions on startup, re-assigning regions for recovery, and
load balancing as well as monitoring RegionServers.
HRegionServers are responsible for serving data reads and writes.
Zookeeper is responsible for maintaining the live state of the cluster.
You can setup HBase in Single Node and MultiNode (Single Master &
MultiMaster). In production, we prefer to go for Multi Master Setup on
top of HA enabled Hadoop cluster.
Chapter 10
Data Acquisition using Apache Sqoop

This chapter will help you understand and implement in detail:

Apache Sqoop
Sqoop Operations
List Databases
List Tables
Codegen
Eval
Sqoop Import from DB to HDFS
Sqoop Import from DB to Hive
Sqoop Import from DB to HBase
Sqoop Export from HDFS to DB
Sqoop Jobs

Introducing Apache Sqoop


Apache Sqoop is a batch read-write tool designed to transfer data between
Hadoop and RDBMS. In short, while working with Apache Sqoop, either the
source or the destination must be a Database. Sqoop performs two operations;
Import and Export. Import operation is all about performing data transfer
between

RDBMS to HDFS
RDBMS to Hive, and
RDBMS to HBase.

Export operation is all about performing data transfer between HDFS to


RDBMS.
Apache Sqoop supports all Java-compliant Databases. Java-compliant
Databases refer to those Databases whose providers provide pure Java Driver.
Some of the examples of Databases supported by Sqoop are MySQL, MSSQL,
PostgreSQL, and so on.
For data transferring purposes, Apache Sqoop uses MapReduce paradigm for
its execution operation. Apache Sqoop internally creates a MapReduce job
which is then submitted to YARN; the output of which will be stored in the
destination location.
Let us see how to install Apache Sqoop. We will be using Apache Sqoop 1.4.6
for all examples.

Requirements for Sqoop Lab Exercises

1. Ensure Hadoop is up and running


2. Ensure MySQL is Installed and Configured. I will assume a username to
access MySQL database as ‘root’ and password as ‘123456’.
3. Ensure you have created a sample Database and table. Conversely, you
can use the Database dump available in the support files downloaded
earlier.
4. Ensure you have downloaded the MySQL Java Connector.
5. Ensure Apache Hive is installed. This is required for Sqoop Import to
Hive demo.
6. Ensure Apache HBase is installed. This is required for Sqoop Import to
HBase demo.

Installing and Configuring Apache Sqoop


Lab 46 Installing and Configuring Apache Sqoop 1.4.6

Step 1: Download Apache Sqoop tar file and place the same in the home folder
of hadoop user. (/home/hadoop) wget
http://redrockdigimark.com/apachemirror/sqoop/1.4.6/sqoop-
1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
Step 2: Extract the tar file and rename the extracted folder as sqoop.
tar -xvzf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz mv sqoop-
1.4.6.bin__hadoop-2.0.4-alpha.tar.gz Step 3: Add the below entries in
.bashrc file
vi /home/hadoop/.bashrc
#Sqoop Configuration
export SQOOP_PREFIX=/home/hadoop/sqoop export
PATH=$PATH:$SQOOP_PREFIX/bin export
HADOOP_HOME=/home/hadoop/hadoop2
export HADOOP_COMMON_HOME=/home/hadoop/hadoop2
export HADOOP_MAPRED_HOME=/home/hadoop/hadoop2
export HIVE_HOME=/home/hadoop/hive
export HBASE_HOME=/home/hadoop/hbase Step 4: Copy the
MySQL jar file in the lib folder of Sqoop
cp /home/hadoop/mysql-connector-java-5.1.42-bin.jar
/home/hadoop/sqoop/lib/.
Sqoop Installation is done successfully.
Note:- Here I am assuming you have installed MySQL in your system. Incase
you didn’t installed, perform the following command to install MySQL, sudo
apt-get install mysql-server
To start MySQL,
mysql -u root -p123456
Refer https://bigdataclassmumbai/hadoop-book for more details on setting
mysql and loading example databases for Sqoop practice.

Sqoop List Databases Command


This command can be used to test Apache Sqoop Connectivity with Database,
since this command does not spin up any map reduce job.
Lab 47 Listing the Databases in MySQL Using Apache Sqoop

Try the following:


sqoop list-databases --connect jdbc:mysql://hadoopvm:3306 --
username root --password 123456
If the above command can list all the databases available in the DB, then your
Sqoop is working properly. Please note that you may get some warnings, for
instance, HCAT_HOME not present or ACCUMULO_HOME not present. You
can ignore these since we are not using the same. If you want to use them, then
please ensure the above mentioned variables are set in .bashrc file.
Sqoop Command Parameters
To get the list of parameters for Sqoop support, type:
sqoop help
Following is the output you get:
All the commands are self-explanatory. We will explore some of the important
and most commonly used parameters in this book individually since covering
everything here is out of the scope of this book.

Sqoop List Tables


This command is used to list all the tables available in your Database.
Lab 48 Listing the Tables in a Database in MySQL Using Apache Sqoop
Following is an example usage of the list command:

sqoop list-tables --connect


jdbc:mysql://hadoopvm:3306/hadoopbook --username root --
password 123456
Sqoop Codegen
Sqoop’s codegen parameter is beneficial for Java programmers. Every Database
table has one DAO class which contains a ‘getter’ and ‘setter’ class. Codegen
generates the DAO class automatically.
Lab 49 Generating DAO for a Table Using Apache Sqoop

Let’s see how to generate one:


Step 1: Ensure Hadoop Services are live and active
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
yarn-daemon.sh start resourcemanager yarn-daemon.sh
start nodemanager
jps
Step 2: Ensure MySQL has relevant data
Database: hadoopbook
Table: emp
Step 3: Type the following command
sqoop codegen --connect
jdbc:mysql://hadoopvm:3306/hadoopbook --username root --
password 123456 --table emp The above command will generate 3
objects:
emp.class emp.jar emp.java
Sqoop Eval Command
Sqoop’s Eval command parameter allows a user to perform DDL and DML
queries against the DB and preview the results in the console. We will see two
evaluations:

1. Select Query Eval


2. Insert Query Eval

Lab 50 Perform Sqoop Eval for Select Query

a. Select Query Eval: Let’s evaluate emp table with a simple select query

sqoop eval --connect jdbc:mysql://hadoopvm:3306/hadoopbook --


username root --password 123456 --query “select * from emp”
17/06/10 01:29:17 INFO sqoop.Sqoop: Running Sqoop version:
1.4.6
17/06/10 01:29:17 WARN tool.BaseSqoopTool: Setting your
password on the command-line is insecure. Consider using -P
instead.
17/06/10 01:29:18 INFO manager.MySQLManager: Preparing to
use a MySQL streaming resultset.
Lab 51 Perform Sqoop Eval for Insert Query

a. Insert Query Eval: Let’s try inserting a record in the table

sqoop eval --connect jdbc:mysql://hadoopvm:3306/hadoopbook --


username root --password 123456 --query “insert into emp values
(4,’Premanand’,’Nair’,400000)”
17/06/10 01:31:17 INFO sqoop.Sqoop: Running Sqoop version:
1.4.6
17/06/10 01:31:17 WARN tool.BaseSqoopTool: Setting your
password on the command-line is insecure. Consider using -P
instead.
17/06/10 01:31:17 INFO manager.MySQLManager: Preparing to
use a MySQL streaming resultset.
17/06/10 01:31:18 INFO tool.EvalSqlTool: 1 row(s) updated.
Sqoop Import Operation
Sqoop Import operation refers to importing data from the Database engine to
HDFS or Hive or HBase as shown in the figure below:

Fig 1 – Apache Sqoop Import Operation Flow


As shown in the above diagram, when we initiate the Sqoop Import operation,
the following operations take place at the backend:

1. Sqoop parses the query provided to ensure that there exist no syntax
errors and all required parameters are satisfied. If the required
parameters are not passed, it may either fail or take the default values.
2. After parsing, Sqoop performs code generation. Code generation is
about generating an MR code DAO which can later be submitted to
YARN to initialize the processing. Sqoop internally uses MR for its
operation.
3. Once the code is generated (technically the JAR file), the JAR is then
submitted to YARN to perform processing. Once processing is done,
the mapper output will then be stored in the HDFS output location
defined in the Sqoop command.
4. If you are performing Hive or HBase Transfer, the above step will be
considered a staging output which will then be transferred to Hive or
HBase depending on the type of Import.

In this section we will see how to perform import operation.

Importing a Table from Database Having Primary Key Column


Lab 52 Importing a Table from Database Having Primary Key
Column Using Sqoop In this example we will see how to perform
Import operation from Database to HDFS. We will use MySQL
for database. We have already created a database named
‘hadoopbook’ which contains one table named ‘emp’.
Let us see how to import the contents of the entire table in HDFS location.
MySQL Database (DBName: hadoopbook ; Table Name:
Source
emp)
DBUsername root
DBPassword 123456
ConnectionString jdbc:mysql://hadoopvm:3306

sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook


–-username root -–password 123456 –-table emp In the above
command, we didn’t specify the HDFS destination location. By default,
Sqoop will store the output data in /user/hadoop location (which is the home
folder of your HDFS).

Importing a Table without a Primary Key and Specifying the


Destination Lab 53 Importing a Table from Database without a
Primary Key Column and Specifying a Destination Location in
HDFS Using Sqoop If your table doesn’t have a primary key, you
can perform Import operation by adding -m parameter as shown
below: sqoop import –connect
jdbc:mysql://hadoopvm:3306/hadoopbook –-username root -–
password 123456 –-table emp –-target-dir /sqoop/empdata -m 1
In this example, your output will be stored in the /sqoop/empdata location.
-m parameter is used to set the number of mappers. For ease of understanding,
I set only 1. The default number of mapper in a Sqoop job is 4. Remember that
the number of mappers is equal to the number of output files since Sqoop mostly
performs only mapper jobs.

a. Importing a subset of data from a table


For example sake, let’s suppose you want to import data from the
database based on some conditions, you can use --where parameter to
achieve the same. Let us see how to do it.

Importing a subset of data from a table in DB


54 Importing a Subset of Data from a Table in Apache Sqoop

Problem statement: Import all records from emp table whose salary is greater
than 200000
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook
--username root --password 123456 --table emp --target-dir
/sqoop/emp_subset_2L --where “esal > 200000”
Import Using SQL Query
Lab 55 Importing a Table Using SQL Query in Apache Sqoop
If you want to import data from a table using SQL query, you can use --query
parameter.
Problem statement: Import all records from emp table whose salary is greater
than 90000 and who live in Mumbai location.
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook --
username root --password 123456 --target-dir /sqoop/output90000Mumbai -
-query “select * from emp where esal > 90000 and location = ‘Mumbai’ and
\$CONDITIONS” -m 1
Where:
--query will accept my query to be executed and the output of the same will
be emitted in my output location.
Please note that whenever you apply --query parameter, it is mandatory to
specify the where condition with $CONDITIONS as shown in the Sqoop query
above. For the next example, we will use --query again to show how to use it
even when you don’t need to declare the ‘where’ clause in the query.

Changing the Output Delimiter During Import Till now, for


whichever Import operation you have performed, the field
delimiter of the output is comma. However, you can change the
field delimiter if required. This can be achieved using the --fields-
terminated-by parameter.
Let us see how to do it:
Lab 56 Changing the Output Delimiter During Import

Problem Statement: Import first two records in HDFS location /sqoop/final with
delimiter as ‘|’ and ensure that there exists only one mapper output.
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook
--username root --password 123456 --target-dir /sqoop/final --
query “select * from emp where \$CONDITIONS LIMIT 2” -m 1
--fields-terminated-by ‘|’
Where:
-m 1 will ensure that there exists only one mapper output --fields-terminated-
by will ensure the delimiter will be ‘|’

Dealing with Incremental Import


HDFS by default is WRITE ONCE storage. It only supports append. Keeping
this in mind, Sqoop created a way to perform incremental import technique to
deal with the recent data that is stored in the database. Let’s see the usage of this
command: Lab 57 Performing Incremental Import
Problem Statement: Perform Import of the recent data stored in Database in the
existing HDFS location containing previously imported data. The last record
inserted was eid=5.
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook
--username root --password 123456 --table emp -m 1 --
incremental append --check-column eid --last-value 5
Where:
--incremental append is responsible for appending the output in the existing
HDFS location --check-column is used to specify which column the incremental
import is dependent upon --last-value is used to specify the last value inserted in
the HDFS of the column specified in check-column parameter.

Importing Data from Database to Apache Hive In this section, we


will see how you can transfer the data from Database to Apache
Hive. Let’s first understand how it works.

Fig 2 – Apache Sqoop Hive Import Operation Flow As shown above, Import to
Hive is a three-step process:

1. Data from the database will be initially transferred to the HDFS


temporary location. A user can define the temporary location using --
target-dir parameter, but it is not mandatory. If --target-dir is not
specified, the default location of HDFS will
be/user/<username>/<tablename_in_db>
2. Once the data is committed in the HDFS storage, Sqoop initiates a
Hive transfer by connecting to Hive and ensuring the Hive database
exists. You can manually specify in which Hive Database the table is to
be created. Please note that Sqoop cannot create a new Database in
Hive. It is mandatory to have a database created before initiating the
Hive Import. To manually specify which database and table (in Hive)
the data is to be transferred to, you can use --hive-table parameter. If
the desired table does not exist in Hive, you can make Hive create one
by specifying --create-hive-table parameter. Please note if the table
already exists in Hive and if you pass --create-hive-table parameter,
Sqoop will throw an exception and the transfer will terminate.
3. Once Hive accepts the data, the temporary folder will be deleted and
the operation will be completed with a success flag.

Lab 58 Importing Data from Database to Hive

Problem Statement: Import data from Database to Apache Hive directly ensuring
a new table named ‘employee’ gets created inside the database named hr in
Apache Hive and HDFS temporary location/staging area must
be/sqoop/hivetemp2. hr database in Apache Hive exists.
Solution
sqoop import --connect jdbc:mysql://hadoopvm:3306/hadoopbook
--username root --password 123456 --table emp --hive-import --
create-hive-table --hive-table hr.employee --target-dir
/sqoop/hivetemp2
Where:
--hive-import – Ensure import is a Hive import --create-hive-table – Ensure
a table called ‘employee’ inside hr database is created automatically during
Import process. The name of this table is mentioned in the -–hive-table
parameter.
--hive-table – Ensure the name of the table to be maintained/used during Hive
Import Importing Data from Database to Apache HBase In this section, we
will see how to transfer data from the Database to HBase. In the case of HBase,
the operation is similar to Hive Import, with the only difference being that you
must create a table and a column-family in HBase before performing Import
operation. Also it is the responsibility of the user to determine which field in the
table will be the row key. Usually the row key is the primary key of the table.
However, any field in the database can act as a row key. The only criterion for
the field to be a row key is that it should hold unique values in the column.
Lab 59 Importing Data from Database to HBase

Problem Statement: Perform Database to HBase transfer with the following


specifications:

Row Key: eid


Table: hrdept
ColumnFamily: employee

Solution
Step 1: Start HBase shell
HBase Shell
Step 2: Create a table named hrdept with columnfamily employee
hbase(main):001:0> create ‘hrdept’,’employee’
0 row(s) in 1.5500 seconds
=> Hbase::Table - hrdept
hbase(main):002:0> exit
Step 3: Initiate Sqoop Transfer using the following command: sqoop import --
connect jdbc:mysql://hadoopvm:3306/hadoopbook --username root --
password 123456 --table emp --hbase-table hrdept --column-family
employee --hbase-row-key eid -m 1
Where:
--hbase-table is used to define the name of the HBase table.
--column-family is used to specify the name of the column family in which
the data is to be transferred.
--hbase-row-key is used to specify which field in the table should be
considered as the row key for HBase.
Step 4: Check the output in Hbase
hbase(main):001:0> scan ‘hrdept’
Note:- Sqoop provides a parameter called --hbase-create-table parameter which
can create Hbase table and column family. However this will work only when
your Sqoop is compatible with the Hbase version. See the below link for more
details: https://issues.apache.org/jira/browse/SQOOP-2759

Sqoop Export Operation


Sqoop Export is about transferring data from HDFS to the Database. When
exporting the data, it is mandatory to create a Database and a table in RDBMS
before initiating the Sqoop command. Sqoop internally cannot create the
Database and table automatically in the RDBMS engine.
Let’s see how the Export works:
Lab 60 Perform Sqoop Export Operation from HDFS to Database

Problem Statement: Transfer the data from HDFS to Database using the
following specifications: HDFS file location - /data/sample
Database Name – test
Table Name – hobby
Table Fields – eid(int),ehobby(varchar(30))
Solution
Step 1: Open MySQL and create the relevant database and table mysql -u root -
p123456
mysql> create database test; Query OK, 1 row affected (0.01 sec)
mysql> use test;
Database changed

Query OK, 0 rows affected (0.01 sec) Step 2: Initiate the export
operation.
sqoop export --connect jdbc:mysql://localhost:3306/test --
username root --password 123456 --table hobby --export-dir
/data/sample --input-fields-terminated-by ‘,’ -m 1
Step 3: Check the result in MySQL
mysql> select * from test.hobby;

4 rows in set (0.02 sec)


Working with Sqoop Jobs
Sqoop provides a feature to create a job and execute the same later. The job can
also be scheduled - either using shell script or using Apache Oozie. Sqoop jobs
are ideally used for two operations:

a. Import Operation
b. Export Operation

Let us see an example to understand how Sqoop Import job works: Lab 61
Creating and Working with Sqoop Job
Step 1: Let us first check if there are any existing and saved Sqoop Jobs sqoop
job --list
Step 2: Let us create a new Sqoop job to transfer data from the Database to
HDFS with the following specifications: Database Name: hadoopbook
Table Name: emp
HDFS location: /sqoop/job1output
sqoop job --create ImportJob1 -- import --connect
jdbc:mysql://hadoopvm:3306/hadoopbook --username root --
password 123456 --table emp --target-dir /sqoop/job1output -m 1
Note: There is a space between -- and Import in the command above. Please
ensure that you put the space to avoid errors.
Step 3: Running ImportJob1
sqoop job --list
Available jobs:
ImportJob1
sqoop job --exec ImportJob1
The above command may ask you to enter the Database password. Enter the
same and the command will execute the created job.
Let us see some more parameters of Sqoop Jobs.

a. Delete: This parameter is used to delete the job based on JobName.


e.g. sqoop job --delete ImportJob1
b. Show: This parameter is used to show the values of each parameter of
the Job.
e.g. sqoop job --show ImportJob1

Summary

Apache Sqoop is meant to perform data transfer between Java-compliant


Databases and Hadoop.
Sqoop Import refers to transferring data between Database and HDFS,
Database and Hive, or Database and HBase.
Sqoop internally stores all staging data in HDFS during Hive import
operation.
Sqoop currently supports --hbase-create-table parameter only if Sqoop is
compatible with HBase. Ideally it is recommended that you create a table
before initiating the transfer or ensure HBase-Sqoop Compatibility.
You can use Sqoop Job feature to create a job. You can also schedule the
jobs using shell and crontab or using Apache Oozie.
Chapter 11
Apache Oozie

In this chapter, you will learn:

Introduction to Apache Oozie.


Apache Oozie Components.
Apache Oozie Architecture.
Building, Installing and Configuring Apache Oozie.
Running a Sample Application.
Creating an MR workflow.
Common Apache Oozie Commands.

Introduction to Apache Oozie


Apache Oozie is a scalable, reliable, and extensible Workflow Scheduler and
orchestration system for Hadoop and all its Ecosystem jobs. Apache Oozie
allows orchestration of complex multi-stage Hadoop and its Ecosystem
component jobs. A workflow in Apache Oozie is a set of Directed Acyclic Graph
of Actions. The basic building blocks of Apache Oozie are Workflows,
Coordinators, and Bundles. Let us understand each building block of Apache
Oozie.
A workflow is a set of actions - the order and the conditions under which those
actions should be performed. When we say set of actions, we may refer them to
a Hive Query, Pig Query, Shell Scripts, Python Code, MapReduce Job, or a
Scala program.
We might want workflows to run at certain times and frequencies, provided
the input to the workflow is available. For this kind of control, a coordinator is
used in Apache Oozie. The coordinator schedules the execution of the workflow
at a specified time. The coordinator also ensures that the workflow execution
may be delayed till the input data is unavailable. Each of the workflows can have
its own coordinator.
A collection of co-ordinator jobs which can be started, stopped, and modified
together is called a Bundle. The output of one co-ordinator job managing the
workflow can be the input to another co-ordinator job. Thus, a bundle can be a
data pipeline. Let us take a hypothetical example to understand each term of
Apache Oozie.
The figure below is an example of a complex multi-stage Hadoop job.

The above example is a Apache Oozie Bundle Job consisting of 3 workflows:


WorkFlow1(WF1), WorkFlow2(WF2) and WorkFlow3(WF3). Each workflow
consists of a set of directed actions.
For example in WF1, loading data from HDFS is an action. WF1 and WF2 are
independent workflows which can run parallel and WF3 is dependent on the
output of WF1 and WF2 since they will act as inputs for WF3. Thus, for WF3
the execution will be taken care of by a co-ordinator and will ensure that the
execution will be at a halt or paused till the time both WF1 and WF2 do not
produce any output. This entire operation can be managed by a single Apache
Oozie job and you as an Oozie admin will need to take care of only this Apache
Oozie job. The Apache Oozie job is an xml-coded file that contains information
of workflows and its execution properties. In this chapter, we will explore how
to create one.

What is an Oozie Application?


An Apache Oozie Application is an XML file that describes the application. It
includes configuration files, JARs, scripts, and so on that performs their
individual actions. The application can be a workflow which runs manually, or
by a single co-ordinator, or using a set of co-ordinators forming a bundle. We
will deal with 3 files while working with Oozie orchestrations. They are:

workflow.xml.
coordinator.xml.
bundle.xml.

When it comes to execution of an Apache Oozie application, Oozie expects all


the files to be in HDFS before the execution starts. So technically, the XML files
along with the dependency JARs and scripts that are required for an Apache
Oozie application must be uploaded in the HDFS before the application is
executed.

Typical Apache Oozie Architecture

The Apache Oozie Client is responsible to query the Oozie Server. The
Apache Oozie client is a CLI tool that the admin/user will use to interact with
the Oozie server. A user will submit an Oozie application via the Apache Oozie
client to the Oozie server.
The Apache Oozie server stores the information in the database and retrieves
all information necessary to run an Apache Oozie application. It then runs the
application in the Hadoop cluster. The Apache Oozie server is the core layer of
the processing in Apache Oozie responsible to manage the Apache Oozie job
scheduling and execution of any Oozie applications that the user/admin submits
to Oozie. The Apache Oozie server runs in a web container like Apache Tomcat.
Apache Oozie server is a stateless server. However, it holds the state and job
related information of the Apache Oozie application in the database. The Apache
Oozie server provides a REST API and Java client so that client can be written in
any language.
Technically, the Apache Oozie Server acts like a client application for Hadoop
cluster. It reads their XML files from HDFS and runs the job on top of the
Hadoop Cluster. It can run MapReduce jobs, Hive, Pig, Sqoop, DistCp, Spark,
and many more.
The database in the Apache Oozie Architecture is used to store all job related
information like, state of the job, reference of the job, the binaries or JARs the
job needs to execute, and so on. Apache Oozie supports Derby, MySQL, Oracle,
MSSQL, HSQL, and PostgreSQL databases. The default database is Derby
which is built-in the Apache Oozie setup. However, in production we prefer to
use other databases.

Building and Installing Apache Oozie


Apache Oozie does not provide binary tarball since every Hadoop cluster and the
configuration of its eco-system are unique and vary in terms of compatibility,
versioning of frameworks, and integration techniques.
Lab 62 Building and Installing Apache Oozie

We will be working on Apache Oozie 4.2.0. Let us now start it step by step:
Part1: Prepare the System for Building Oozie.
Step 1: It requires Apache Maven which can be downloaded from
http://maven.apache.org/download.cgi. Once downloaded, untar the same.
wget http://www-eu.apache.org/dist/maven/maven-
3/3.5.0/binaries/apache-maven-3.5.0-bin.tar.gz
tar -xvzf apache-maven-3.5.0-bin.tar.gz Step 2: Inform system about
Apache Oozie. This can be done by setting up environment variables in
.bashrc file.
vi /home/hadoop/.bashrc
export PATH=$PATH:/home/hadoop/apache-maven-3.5.0/bin Now
Save it !!!
Step 3: Apply the changes.
exec bash
Step 4: Download Apache Oozie and extract the same
wget http://archive.apache.org/dist/oozie/4.2.0/oozie-4.2.0.tar.gz
tar -xvzf oozie-4.2.0.tar.gz
Step 5: Open pom.xml and perform the following,
cd oozie-4.2.0
vi pom.xml
Find the below strings and perform the changes:

a. Edit Java Version. Search for <targetJavaVersion> tag in the file and
change the version of Java to 1.7 assuming you have installed JDK7. In
case you are using JDK8, you could change the below line to 1.8.
Ideally, it will be line 48 of your pom.xml file (if you are using the same
version as the one I am using).
<targetJavaVersion>1.7</targetJavaVersion>
b. Search maven-javadoc-plugin. You may see a tag <artifactId>maven-
javadoc-plugin</artifactId>. Ideally, it will be line 1491 of your
pom.xml (if you are using the same version as the one I am using). Add
the following lines inside the <configuration> tag. Please note that the
<additionalparam> tag is applicable only when you work in JDK8. In
case you are using JDK7, please don’t add the <additionalparam> tag
line.
<javadocExecutable>/usr/lib/jvm/java-7-
oracle/bin/javadoc</javadocExecutable>
<additionalparam>-Xdoclint:none</additionalparam>
c. Search hadoop.version and activate the profile of hadoop-2(true) and the
rest false. In my case, it is line 79 of my pom.xml file that I use for
changing the Hadoop version. Ensure you enter the same Hadoop
version that you have installed. While writing this book, I used Hadoop-
2.7.3 to test Apache Oozie. In case you are using Hadoop-2.8.0, make
the changes accordingly.
<hadoop.version>2.7.3</hadoop.version>
<hadoop.majorversion>2</hadoop.majorversion>
To activate hadoop2 profile, search hadoop-2. In my case, it is line 1788
of my pom.xml. Change the following tags as given below:
<activeByDefault>true</activebyDefault>
<hadoop.version>2.7.3</hadoop.version>
d. Search for Codehaus repository url. We will need to replace the link
with the one given below. In my case, it is line 145 of my pom.xml.
<url>https://repository-
master.mulesoft.org/nexus/content/groups/public/</url>
Now save pom.xml after all the 4 changes are done as explained above.

Step 6: Navigate to hadooplibs folder and open pom.xml


cd hadooplibs
vi pom.xml
Search for hadoop-1 profile and change it to false. In my case, it is line 48.
Make changes as shown below in line 50.
<activeByDefault>false</activeByDefault> Search for hadoop-2
profile and change it to true. In my case, it is line 70. Make changes as
shown below in line 72.
<activeByDefault>true</activeByDefault> The above steps will
ensure that Apache Oozie builds for Hadoop2 instead of the Hadoop1
version. Save pom.xml.
Step 7: Navigate to the bin folder of Apache Oozie and start the build command
by invoking mkdistro.sh script.
cd ..
cd bin
./mkdistro.sh -DskipTests
Once done, you will get a message saying, Oozie distro created along with the
path. Ideally it will be present in/home/hadoop/oozie-4.2.0/distro/target folder
named oozie-4.2.0-distro.tar.gz. Extract the same, rename the folder and copy
the same in/home/hadoop/location .
cd /home/hadoop/oozie-4.2.0/distro/target tar -xvzf oozie-4.2.0-
distro.tar.gz
mv oozie-4.2.0 oozie
cp -r oozie /home/hadoop/.
Part 2: Copy all Hadoop JARs to Apache Oozie in order to resolve
dependency of Apache Oozie application. Assuming your present working
directory is /home/hadoop cd oozie
mkdir libext
cd libext
Copy the jars from hadooplibs folder of oozie src.
cp /home/hadoop/oozie-4.2.0/hadooplibs/hadoop-utils-
2/target/oozie-hadoop-utils-hadoop-2-4.2.0.jar .
cp /home/hadoop/oozie-4.2.0/hadooplibs/hadoop-auth-
2/target/oozie-hadoop-auth-hadoop-2-4.2.0.jar .
cp /home/hadoop/oozie-4.2.0/hadooplibs/hadoop-distcp-
2/target/oozie-hadoop-distcp-hadoop-2-4.2.0.jar .
Copy jars from Hadoop folder
cp $HADOOP_INSTALL/share/hadoop/common/*.jar .
cp $HADOOP_INSTALL/share/hadoop/common/lib/*.jar .
cp $HADOOP_INSTALL/share/hadoop/hdfs/*.jar .
cp $HADOOP_INSTALL/share/hadoop/mapreduce/*.jar .
cp $HADOOP_INSTALL/share/hadoop/yarn/*.jar .
Part 3 – Configure Hadoop for Apache Oozie interaction.
Ensure all Hadoop services are stopped before performing the following
changes: Step 1: Setup core-site.xml
vi /home/hadoop/hadoop2/etc/hadoop/core-site.xml Add the
following properties within the configuration tag
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
Step 2: Setup oozie-site.xml
vi /home/hadoop/oozie/conf/oozie-site.xml
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations
<value>*=/home/hadoop/hadoop2/etc/hadoop/</value>
</property>
Now you can start Hadoop
Part 4: Prepare and Configure Apache Oozie.
Step 1: Install zip and unzip the software in the system since it will be used by
oozie-setup.sh while preparing the war file.
sudo apt-get install zip unzip
Step 2: Prepare the war file for the Apache Oozie server web app. Assume that
your present working directory is/home/hadoop, perform the following: cd oozie
bin/oozie-setup.sh prepare-war
If everything goes positive, you will get a message saying, “Oozie is ready to
be started”.
Step 2: Setup the sharelib that will be used by Apache Oozie applications.
bin/oozie-setup.sh sharelib create -fs hdfs://hadoopvm:8020
You will get a similar path of sharelib like the one shown below. Make a note
of this path since you may require it while triggering the Apache Oozie
application.
Note: the destination path for sharelib is:
/user/hadoop/share/lib/lib_20170524081704
Step 3: Setup Apache Oozie Database. This will maintain all the job related
information as discussed in Apache Oozie Architecture.
bin/ooziedb.sh create -sqlfile oozie.sql -run You will get a message as
shown below indicating the database is ready.
Oozie DB has been created for Oozie version ‘4.2.0’
Step 4: Copy all JARs present in libext folder in the Apache Oozie server
Webapp. Assume your present working directory is /home/Hadoop. Perform the
following command: cp -r oozie/libext/*.jar oozie/oozie-
server/webapps/ROOT/WEB-INF/lib/.
Part 5: Start Apache Oozie.
Assuming your present working directory is/home/hadoop,
cd oozie
To start Apache Oozie:
bin/oozied.sh start
To check whether Apache Oozie is working normally or not, type:
bin/oozie admin -oozie http://hadoopvm:11000/oozie -status To stop
Apache Oozie, type:
bin/oozied.sh stop
Testing Apache Oozie with Example Workflow

Lab 63 Testing Apache Oozie with an Example Mapreduce Workflow

Let us now test whether Apache Oozie is capable of running a workflow or not.
To do so, we will use the examples provided in the Apache Oozie source code
folder. You can check all examples in the given below location: ls oozie-4.2.0
/examples/target/oozie-examples-4.2.0-examples/examples/apps/
Let us try executing a MapReduce Example (map-reduce)
Step 1: Navigate to the map-reduce folder
cd oozie-4.2.0 /examples/target/oozie-examples-4.2.0-
examples/examples/apps/map-reduce Step 2: Edit job.properties file
with the relevant configurations.
vi job.properties
Edit the lines as shown below:
nameNode=hdfs://hadoopvm:8020
jobTracker=hadoopvm:8032
#Add the below line. Its not there by default in the properties file
oozie.system.libpath=true
Now Save it !!!
Step 3: Upload the examples folder in the home folder of HDFS
cd ../../../
hdfs dfs -put examples examples
This will upload your examples in/home/hadoop/examples HDFS location.
Step 4: Running an Apache Oozie application. Assume your present working
directory is /home/hadoop cd oozie
bin/oozie job -oozie http://hadoopvm:11000/oozie -config
/home/hadoop/oozie-4.2.0/examples/target/oozie-examples-4.2.0-
examples/examples/apps/map-reduce/job.properties -run job:
0000000-170626143054944-oozie-hado-W
You will get a job id similar to the one shown above. You can now check the
status of the job. Please note that after info it is the job id that I have picked from
the previous command output.
oozie job -oozie http://hadoopvm:11000/oozie -info 0000000-
170626143054944-oozie-hado-W
In case there are any errors, you can check the log of the job using: oozie job -
oozie http://hadoopvm:11000/oozie -log 0000000-170626143054944-oozie-
hado-W
Now we can say that our Apache Oozie setup is done successfully.
Creating a MapReduce Workflow
Workflow is the fundamental building block of Apache Oozie that contains a set
of actions. An action does the actual processing in the workflow. An action can
be a Hadoop MR job, Sqoop job, Hive scripts, Pig scripts, Shell scripts, Spark
Jobs, and so on.
Let us now learn how to create a MapReduce workflow. Ideally, as an
administrator, you will be given a JAR file which contains an MR code to be
triggered.
Lab 64 Creating and Running a Mapreduce Workflow

We can now see how to create an Apache Oozie application with a single
workflow.
Step 1: Create a folder that will hold your Apache Oozie application. Assume
your present working directory is /home/hadoop mkdir MROozie
mkdir MROozie/input
mkdir MROozie/lib
As mentioned above, the input folder will have the input file which is to be
processed, and the lib folder will have the MR JAR file which is to be executed.
Step 2: Copy MR job JAR file provided in the Apache Oozie example. This is
the same JAR file that we tried in our YARN chapter.
cp -r /home/hadoop/oozie-4.2.0/examples/target/oozie-examples-
4.2.0-examples/examples/apps/map-reduce/lib/oozie-examples-
4.2.0.jar /home/hadoop/MROozie/lib/.
Step 3: Create a sample input file inside input folder of MROozie.
vi /home/hadoop/MROozie/input/sample.txt
Welcome to Oozie
Welcome to Oozie
Step 4: Create a dedicated directory named oozie where we will maintain all our
applications which will be orchestrated by Apache Oozie.
hdfs dfs -mkdir oozie
hdfs dfs -mkdir oozie/mapreduce
Step 5: Let us understand and create job.properties file
Ideally, this file resides in the local file system that will be used while
launching the Apache Oozie application through the Apache Oozie client. You
can consider this file as an entry point of an Apache Oozie application similar to
main () in Java or C++ program. This file contains all necessary information like
input, output, and other necessary arguments and parameters for Apache Oozie
to invoke the workflow.
vi /home/hadoop/MROozie/job.properties
nameNode=hdfs://hadoopvm:8020
jobTracker=hadoopvm:8032
queueName=default
oozieRoot=oozie
oozie.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${oozieRoot}/map
Following is the explanation for each configuration set in this file:
Property Name Explanation Value
NameNode This parameter is hdfs://hadoopvm:8020
used to set the
NameNode URL
of the Hadoop
cluster. This will
help Apache
Oozie interact
with HDFS for
storage related
operations and to
access applications
residing in HDFS.
JobTracker This parameter is hadoopvm:8032
used to specify
where we submit
the MR
job/application. In
our case we will
specify the default
YARN port
(ResourceManager
port = 8032)
QueueName Used to specify default
the type of queue
to be applied
oozieRoot Used to specify oozie
the root folder
name where the
application will
reside. In our case
we have created
the oozie folder in
the home folder of
HDFS
oozie.system.libpath Specifies to look true
for JARs and
essential libraries
in the sharelib path
oozie.wf.application.path This parameter ${nameNode}/user/${user.name}/${oozieRoot}
specifies the
location of
workflow.xml in
HDFS which is the
actual workflow.

Step 6: let us now write the Apache Oozie application xml code and the
workflow.xml file. This file is the actual operation workflow.
Step 7: Copy the created application in HDFS
hdfs dfs -put /home/hadoop/MROozie/*
/user/hadoop/oozie/mapreduce Step 8: Run the Apache Oozie
Application
oozie job -oozie http://hadoopvm:11000/oozie -config
/home/hadoop/MROozie/job.properties -run Common Commands in
Apache Oozie
Let us explore some of the commonly used Apache Oozie commands

1. Running a Job in Apache Oozie


oozie job -oozie http://hadoopvm:11000/oozie -config
/home/hadoop/MROozie/job.properties -run
2. Submit a Job in Apache Oozie
oozie job -oozie http://hadoopvm:11000/oozie -config
/home/hadoop/MROozie/job.properties -submit
3. Check the Status of Job in Apache Oozie
oozie job -oozie http://hadoopvm:11000/oozie -info
0000006-170626143054944-oozie-hado-W
4. Suspending a Workflow in Apache Oozie
oozie job -oozie http://hadoopvm:11000/oozie -info
0000006-170626143054944-oozie-hado-W
5. Resume a Suspended Workflow in Apache Oozie
oozie job -oozie http://hadoopvm:11000/oozie -info
0000006-170626143054944-oozie-hado-W
6. Re-run a Workflow
oozie job -oozie http://hadoopvm:11000/oozie -config
/home/hadoop/MROozie/job.properties -rerun
7. Kill an Apache Oozie job
oozie job -oozie http://hadoopvm:11000/oozie -info
0000006-170626143054944-oozie-hado-W
8. Viewing Server Logs
oozie job -oozie http://hadoopvm:11000/oozie -log
0000006-170626143054944-oozie-hado-W
9. Updating Shared Lib if You Add any JAR files
oozie admin -oozie http://hadoopvm:11000/oozie -
sharelibupdate
10. To List Available Sharelib
oozie admin -shareliblist -oozie
http://hadoopvm:11000/oozie

Summary

Apache Oozie is an orchestration engine used to orchestrate jobs in a


Hadoop cluster.
Apache Oozie by default supports Hive, Distcp, hcatalog, Sqoop,
MapReduce-streaming, Spark, Hive2, and PIG.
job.properties file holds all the parameters that are passed in
workflow.xml.
A workflow holds all actions that are to be executed in the given Hadoop
cluster.
Chapter 12
Introducing Pig, Spark and Flume

In this chapter, we will see how to install and configure:

Apache Pig.
Apache Spark.
Apache Flume.

Apache Pig Installation Steps


Apache Pig was a project initiated and developed by Yahoo, later opensourced in
Apache License 2. Pig is used by scripters and is a useful tool when it comes to
performing ad hoc analysis. Pig internally uses MapReduce and Tez as their
processing engines and HDFS and HDFS Compliant storage as storage engine.
In this section, we will see, how to install Pig and run Pig in MapReduce
Mode. Here I am using Pig-0.17. One of the best feature in Pig 0.17 is the
introduction of Spark, as a processing engine support on Pig.
Pig now supports the following execution modes,

LOCAL
MAPREDUCE
TEZ_LOCAL
TEZ
SPARK_LOCAL (new inclusion in Pig-0.17.0)
SPARK (new inclusion in Pig-0.17.0)

Lab 65 Installing and Configuring Apache Pig

Step 1: Download Apache Pig binaries from pig.apache.org in/home/hadoop


location wget http://redrockdigimark.com/apachemirror/pig/pig-0.17.0/pig-
0.17.0.tar.gz Step 2: Extract the tar file, assuming that your present working
directory is/home/hadoop tar -xvzf pig-0.17.0.tar.gz Step 3: Rename the
extracted folder
mv pig-0.17.0 pig
Step 4: Setup bashrc
vi /home/hadoop/.bashrc
export PIG_PREFIX=/home/hadoop/pig export
PIG_HOME=$PIG_PREFIX
export PATH=$PATH:$PIG_PREFIX/bin Save it !
Step 5: Update the bash
exec bash
Step 6: If you want Pig to work with Spark, you need to export the following
variables export SPARK_MASTER=local
export SPARK_JAR=/spark_jar where SPARK_JAR is an hdfs folder
location, where spark-assembly-*.jar file is stored.
Step 7: Start Pig Shell
pig -x spark
Here, Pig will start in spark local mode.
You can also start pig in spark_local mode by using following command,
without setting the above parameters: pig -x spark_local
However, the purpose of the command was to show - how to set these
parameters, when working with YARN and Spark cluster.

Apache Spark Installation Steps


Apache Spark is a fast, general-purpose processing framework which constitutes
of multiple unified components which enables robust feature in terms of
processing and compatibility for users having multiple specializations. For
DBAs, we have SparkSQL. For programmers, we have Spark Core. For
analytics, we have Spark MLLib and for near-real-time processing, we have
Spark Streaming. In this section, we will limit our discussion to the installation
of Apache Spark.
Lab 66 Installing and Configuring Apache Spark

Step 1: Download Apache Spark from http://spark.apache.org.


wget https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-
hadoop2.6.tgz Step 2: Extract the tar file
tar -xvzf spark-1.6.3-bin-hadoop2.6.tgz Step 3: Rename the extracted
folder
mv spark-1.6.3-bin-hadoop2.6 spark Step 4: Setup environment
variables in the system.
vi /home/hadoop/.bashrc
export SPARK_HOME=/home/hadoop/spark export
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin Save
it!
Step 5: Update the bash
exec bash
Step 6: Create a log directory
sudo mkdir -p /var/log/spark Step 7: Change the ownership of the log
folder
sudo chown -R $USER:$GROUP /var/log/spark Step 8: Create tmp
directory where spark will maintain the staging data mkdir /tmp/spark
Step 9: Configure Spark
vi /home/hadoop/spark/spark-env.sh export
HADOOP_CONF_DIR=/home/hadoop/hadoop2/etc/hadoop
export YARN_CONF_DIR=/home/hadoop/hadoop2/etc/hadoop
export SPARK_LOG_DIR=/home/hadoop/spark/log export
SPARK_WORKER_DIR=/home/hadoop/spark/tmp export
SPARK_WORKER_INSTANCES=1
export SPARK_MASTER_MEMORY=1024m export
SPARK_EXECUTOR_MEMORY=1024m export
SPARK_WORKER_MEMORY=1024m export
SPARK_WORKER_CORES=1
export SPARK_EXECUTOR_CORES=1
Step 10: Configure the Slaves in Spark.
vi /home/hadoop/spark/conf/slaves hadoopvm
Step 11: Start the Spark Services
start-master.sh
start-slaves.sh
Step 12: Start the Spark Shell with master as Spark Master.
spark-shell --master spark://hadoopvm:7077
To start the spark shell with master as local
spark-shell --master local To start the spark shell with master on YARN
spark-shell --master yarn Apache Flume Installation Steps
Apache Flume is one of the tools which can be used for acquiring data from a
real-time data source. In this section, we will simply see how to install Apache
Flume. After installation, we will see how to grab data from a netcat source and
store the same in HDFS.
Lab 67 Installing and Configuring Apache Flume and Grabbing data from
Netcat Source Step 1: Download Apache flume from flume.apache.org

wget
http://redrockdigimark.com/apachemirror/flume/1.7.0/apache-
flume-1.7.0-bin.tar.gz
Step 2: Extract the tar file
tar -xvzf apache-flume-1.7.0-bin.tar.gz Step 3: Rename the extracted
folder.
mv apache-flume-1.7.0 flume Step 4: Setup the environment variables
in the system.
vi /home/hadoop/.bashrc
#Add the following lines at the start of the file export
FLUME_HOME=/home/hadoop/flume export
PATH=$PATH:$FLUME_HOME/bin Step 5: Update the environment
variables.
exec bash
In this way, the installation process is completed successfully. Now, let’s try an
example to check whether the flume is working or not.
Step 1: Create a configuration file in/home/hadoop location named
myncgrab.conf ncgrab.sources=getfromnetcat
ncgrab.sinks=writetohdfs
ncgrab.channels=gotoram
ncgrab.sources.getfromnetcat.type=netcat
ncgrab.sources.getfromnetcat.bind=192.168.247.137
ncgrab.sources.getfromnetcat.port=9999
ncgrab.channels.gotoram.type=memory
ncgrab.sinks.writetohdfs.type=hdfs
ncgrab.sinks.writetohdfs.hdfs.path=hdfs://hadoopvm:8020/user/hadoop/flum
ncgrab.sinks.writetohdfs.hdfs.writeFormat=Text
ncgrab.sinks.writetohdfs.hdfs.fileType=DataStream
ncgrab.sources.getfromnetcat.channels=gotoram
ncgrab.sinks.writetohdfs.channel=gotoram Step 2: Ensure that the
Hadoop services are live and active, since here the sink is HDFS.
Step 3: Open two terminals, one for initializing netcat listener and other for
flume grab execution For netcat listener type the following command,
nc -l 9999
Step 3: Initiate Flume grab.
flume-ng agent -n ncgrab -f /home/hadoop/ncagent.conf Step 4:
Now start emitting data from netcat listener console. Type some words and
press enter. You will see that all the inputs will be stored in the HDFS
location specified in flume configuration.

Summary

Pig is used by scripters and is a useful tool, when it comes to performing


ad hoc analysis.
Apache Spark is a fast, general-purpose processing framework, which
constitutes of multiple unified components which enables robust features
in terms of processing and compatibility for users having multiple
specializations.
Apache Flume is one of the tools, which can be used for acquiring data
from a real-time data source.
About the Author

Prashant Nair, founder of CognitoIT Consulting Pvt Ltd, developed a keen


interest towards IT technologies at the age of nineteen, which led him to pursue
his passion, as a career. His organization provides training and consultancy on
the niche technologies like Bigdata, Cloud, Virtualization and DevOps tools.
Presently, Prashant is an established corporate trainer and Bigdata consultant
having an experience of more than twelve years in the fields of Datacenter and
cluster implementations, cloud computing, Bigdata, DevOps, and Virtualization.
He has also worked in the Bigdata domain as a Solution Architect and Hadoop
consultant. He has trained lakhs of professionals in Bigdata, Cloud and DevOps
tools.
He also enjoys writing technical blogs on his website
https://bigdataclassmumbai.com. You can connect with him on LinkedIn at
https://in.linkedin.com/in/prashant-solution-architect

Вам также может понравиться