You are on page 1of 57

Big Data and Hadoop

By
Ujjwal Kumar Gupta

Contents
Why Big Data & Hadoop
Drawbacks of Traditional Database
Hadoop History
What is Hadoop & How it Works
Hadoop Cluster
Hadoop Ecosystem

Course Topics
Week 1
Understanding Big Data
Hadoop Architecture
Week 2
Introduction to Hadoop 2.x
Data loading Techniques
Hadoop Project Environment

Week 5
Analytics using Hive
Understanding Hive QL
Sqoop Connectivity
Week 6
NoSQL Databases
Understanding HBASE
Zookeeper

Week 3
Hadoop MapReduce Framework Week 7
Programming MapReduce
Apache Spark Framework
Hadoop Installation Cluster
Programming Spark
Setup
Week 8
Week 4
Real world Datasets and Analysis
Analytics using Pig
Project Discussion
Understanding Pig Latin

Why Big Data & Hadoop ?


Following are the reasons why Big Data is needed:
90% of the data in the world today has been created in the last
two years alone.
80% of the data is unstructured or exists in widely varying
structures, which are difficult to analyze.
Structured formats have some limitations with respect to handling
large quantities of data.
It is difficult to integrate information distributed across multiple
systems.
Most business users do not know what should be analyzed.
Potentially valuable data is dormant or discarded.
It is too expensive to justify the integration of large volumes of
unstructured data.
A lot of information has a short, useful lifespan.

Why Big Data & Hadoop ?

Drawbacks of Traditional
Database
Expensive - Out of Reach for small &
mid-size company
Scalability As Data Grows
Expanding the system is a Challenging
task
Time Consuming It takes lots of
time to store & process data

What is Hadoop
Open source framework designed for storage and
processing of large scale data on clusters of
commodity hardware
Created by Doug Cutting in 2006.
Cutting named the program after his sons toy
elephant.

How Hadoop Works


When data is loaded onto the system it is divided
into blocks
Typically 64MB or 128MB
Tasks are divided into two phases
Map tasks which are done on small portions of
data where the data is stored
Reduce tasks which combine data to produce
the final output
A master program allocates work to individual
nodes

3 Vs of Hadoop

Big Data Sources

Big Data Sources


The sources of Big Data are:
web logs;
sensor networks;
social media;
internet text and documents;
internet pages;
search index data;
atmospheric science, astronomy, biochemical and medical
records;
scientific research;
military surveillance; and
photography archives.

Hadoop Cluster

Hadoop Cluster

Who Uses Hadoop

Use Cases of Hadoop

Hadoop Commands
1. Print the Hadoop version
hadoop version
2. List the contents of the root directory in HDFS
hadoop fs -ls /
3. Report the amount of space used and available on currently mounted
filesystem
hadoop fs -df hdfs:/
4. Count the number of directories, files and bytes
hadoop fs -count hdfs:/
5. Run a DFS filesystem checking utility
hadoop fsck /

Hadoop Commands
6. Create a new directory
hadoop fs -mkdir /user/
7.

Upload File to HDFS from Local dir


hadoop fs -put data/sample.txt /user/

8.

View contents of file


hadoop fs -cat /text.txt

9. Delete a file from HDFS


hadoop fs rm /usr/text.txt
10. Delete a Directory
hadoop fs rmr /user/

Hadoop Commands
11. Download File from HDFS to local system
hadoop fs -get /user/test.txt /home/hadoop/
12. Copy File from one dir to other
hadoop fs cp /usr/text.txt /input/
13. Move File from one dir to other
hadoop fs mv /text.txt /input/
14. Change Replication Factor
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
15. Copy file from one node to other
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop

MapReduce Overview

A method for distributing computation across multiple


nodes

Each node processes the data that is stored at that node

Consists of two main phases


Map
Reduce

MapReduce Features

Automatic parallelization and distribution

Fault-Tolerance

Provides a clean abstraction for programmers to use

MapReduce Features

The key reason to perform mapping and reducing is to


speed up the execution of a specific process by splitting
the process into a number of tasks, thus enabling parallel
work.

Individual
Work

Parallel Work

Word Count Example

Count the number of words:


This quick brown fox jumps over the lazy dog. A dog is a
mans best friend.

Map execution consists of following phases:

Map
Phase
Reads assigned
input split from
Parses input
into records
(key/value pairs)
Applies map
function to each
record
Informs master
node of its
completion

Partitio
n Phase
Each mapper
must determine
which reducer will
receive each of
the outputs
For any key, the
destination
partition is same
Number of
partitions =
Number of
reducers

Shuffle
Phase
input
Fetches
data from all
map tasks for
the portion
corresponding
to the reduce
tasks bucket

Sort
Phase
Merge-sorts all
map outputs into
a single
run

Reduce
Phase
Applies userdefined reduce
function to the
merged run
Arguments: key
and corresponding
list of values
Writes output
to a file in HDFS

Running wordcount example


Hadoop setup provides some sample MapReduce examples with its setup , we can directly
run those examples and test MapReduce program
1. Copy the jar file of example from
/<hadoop install path>/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar
2. Put some text file as input file in hadoop
Hadoop fs put input.txt
3. Run WordCount program from that example
Hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount input.txt output
4. Check output in output folder
Hadoop fs cat output/part*

Introduction to Pig
Pig is one of the components of the Hadoop eco-system.
Pig is a high-level data flow scripting language.
Pig runs on the Hadoop clusters.
Pig is an Apache open-source project.
Pig uses HDFS for storing and retrieving data and Hadoop MapReduce for
processing Big Data.

Data Models
As part of its data model, Pig supports four basic types:

Atom

A simple atomic value


Example: Mike

Tuple

A sequence of fields that can be any of the data types


Example: (Mike, 43)

Bag

A collection of tuples of potentially varying structures; can contain


duplicates
Example: {(Mike), (Doug, (43, 45))}

Map

An associative array; the key must be a chararray but the value can
be any type
Example: [name#Mike,phone#5551212]

Pig Execution Modes


Pig Execution Modes
Local mode: the Pig
depends on the OS file
system.

MapReduce mode: the Pig


relies on HDFS.

Installing Pig
1. Download pig tar file from apache website
2. Unzip the tar file
$ tar xvzf pig-0.15.0.tar.gz
3. Move file to install location
$ sudo mv pig-0.15.0 /usr/local/pig
4. Set path in bashrc
$ sudo gedit ~/.bashrc
Add following lines at the end of file
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
5. Save the changes in bashrc
$ Source ~/.bashrc

Pig Commands

Loading and Storing Methods


Loading refers to loading relations from files in the Pig buffer
Store data from Pig engine to HDFS.

Filtering and Transforming


Filtering can be defined as filtering data based on conditional clause.
Transforming refers to making data presentable for extracting logical data.

Grouping and Sorting


Grouping refers to generating a group of meaningful data.
Sorting of data is done to arrange the data in an ascending or descending order.

Combining and Splitting


Combining refers to performing union operation of the data stored in the
variable.
Splitting of data refers to separating data with a logical meaning.

Introduction to Hive
Hive is a data warehouse system for Hadoop that facilitates ad-hoc
queries and the analysis of large data sets stored in Hadoop.
It provides a SQL-like language called HiveQL (HQL). Due to its SQL-like
interface,
Hive is a popular choice for Hadoop analytics.
It provides massive scale-out and fault tolerance capabilities for data
storage and processing of commodity hardware.
Relying on MapReduce for execution, Hive is batch-oriented and has high
latency for query execution.

System Architecture and Components of Hive


The image illustrates the architecture of the Hive system. It also displays the role of
Hive and Hadoop in the development process.
Hive
Command Line Interface

JDBC

Web
Interface

Thrift Server

Driver
(Compiler,Optimizer,Executor)

Hadoop
Node
Manager

ODBC

Name
Node

Data Node
+
Recourse Manager

Metastore

Metastore
Metastore is the component that stores the system catalog and metadata about tables,
columns, partitions, and so on. Metadata is stored in a traditional RDBMS format.
Apache Hive uses Derby database by default. Any JDBC compliant database like MySQL
can be used for Metastore.

Metastore Configuration
The key attributes that should be configured for Hive Metastore are given below:
Parameter

Description

Example

javax.jdo.option.Connectio
nURL

JDBC connect string for a


JDBC metastore

jdbc:derby://localhost:1527
/metastore_db;create=true

javax.jdo.option.Connectio
nDriverNamej

JDBC driver name

org.apache.derby.jdbc.Clie
ntDriver

javax.jdo.option.Connectio
nUserName

Username for database

APP

javax.jdo.option.Connectio
nPassword

Password

mine

Metastore ConfigurationTemplate

Driver
Driver is the component that:
manages the lifecycle of a Hive Query Language (HiveQL) statement as
it moves through Hive; and
maintains a session handle and any session statistics.
Query Compiler
Query compiler compiles HiveQL into a Directed Acyclic Graph (DAG) of
MapReduce tasks.
Query optimizer:
consists of a chain of transformations, so that the operator DAG resulting from
one transformation is passed as an input to the next transformation.
performs taskscolumn pruning, partition pruning, and repartitioning of data.

The execution engine:


executes the tasks produced by the compiler in proper dependency order.
interacts with the underlying Hadoop instance to ensure perfect
synchronization with Hadoop services.

Hive Server
Hive Server provides a thrift interface and a Java Database Connectivity/Open
Database Connectivity
(JDBC/ODBC) server. It enables the integration of Hive with other applications.

Client Components
A developer uses the client component to perform development in Hive. The
client component
includes the Command Line Interface (CLI), the web user interface (UI), and the
JDBC/ODBC driver.
Hive
JDBC
Command Line Interface

Web
Interface

ODBC

Thrift Server

Hive Tables
Tables in Hive are analogous to tables in relational databases. A Hive table logically
comprises the data being stored and the associated meta data. Each table has a
corresponding directory in HDFS.
Two types of tables in Hive
Managed Tables - Tables Managed by Hive
External Tables Tables Managed by user
CREATE TABLE t1(ds string, ctry float, li list<map<string, struct<p1:int,
p2:int>>);
CREATE EXTERNAL TABLE test_extern(c1 string, c2 int) LOCATION
'/user/mytables/mydata';

Hive Data Types


Data Types in Hive

Primitive
Types
Integers: TINYINT,
SMALLINT, INT, and
BIGINT
Boolean: BOOLEAN
Floating point
numbers: FLOAT and
DOUBLE
String: STRING

Complex
Types
Structs: {a INT; b
INT}
Maps: M['group']
Arrays: ['a', 'b',
'c'], A[1]
returns 'b'

Userdefined
Types

Structures with
attributes
Attributes can be
of any type

Installing Hive
1. Download hive tar file from apache website
2. Unzip the tar file
$ tar xvzf apache-hive-2.0.0-bin.tar.gz
3. Move file to install location
$ sudo mv apache-hive-2.0.0-bin /usr/local/hive
4. Set path in bashrc
$ sudo gedit ~/.bashrc
Add following lines at the end of file
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
5. Save the changes in bashrc
$ Source ~/.bashrc
6. Initialize Metastore database
$ schematool initSchema dbType derby

Hive Query Language


1. Create Database
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>;
2. List All Databases
SHOW DATABASES;
3. Enabling a Database
Use dbname;
4. Deleting a Database
DROP DATABASE [IF EXISTS] userdb; // used to delete empty database
DROP DATABASE [IF EXISTS] userdb CASCADE; // used to delete database as well as all tables
of that database
5. Displaying Current database in hive shell
$ hive --hiveconf hive.cli.print.current.db=true

Hive Query Language


1. Create Table
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note:
TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO
num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)
-- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available
in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and
later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external
tables)

Hive Query Language


1. Create Table Sample
CREATE TABLE IF NOT EXISTS employee
(
eid int, name String,
salary String, destination String
)
COMMENT Employee details
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t ;
2. List All tables
Show tables;
3. Checking schema of a table
Describe tablename;
Describe extended tablename;
show create table emp;

Hive Query Language


Loading Data into table
LOAD DATA [LOCAL] INPATH '/home/user/sample.txt'
[OVERWRITE] INTO TABLE employee;
INSERT
INSERT
02-24';
INSERT
24';
INSERT

OVERWRITE TABLE t1 SELECT * FROM t2;


OVERWRITE TABLE sample1 '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample

//Multiple table Insert using single table


FROM records2
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year
SELECT year, COUNT(1)

Hive Query Language


1. Altering a table
ALTER TABLE employee RENAME TO emp;
ALTER TABLE employee ADD COLUMNS ( dept STRING COMMENT 'Department name');
ALTER TABLE name DROP column_name
ALTER TABLE employee CHANGE name ename String;
ALTER TABLE employee CHANGE salary salary Double;
ALTER TABLE employee REPLACE COLUMNS ( eid INT empid Int, ename STRING name String);
2. Deleting a table
DROP TABLE [IF EXISTS] employee;

Hive Query Language


1. Querying records from a table
SELECT
SELECT
SELECT
SELECT
SELECT
SELECT
SELECT
SELECT
SELECT

* FROM employee WHERE Id=1205;


* FROM employee WHERE Salary>=40000;
* FROM <tablename> LIMIT 10;
* FROM <tablename> WHERE freq > 100 SORT BY freq ASC LIMIT 10;
20+30 ADD FROM temp;
* FROM employee WHERE Salary>40000 && Dept=TP;
round(2.6) from temp;
floor(2.6) from temp;
ceil(2.6) from temp;

SELECT Id, Name, Dept FROM employee ORDER BY DEPT;


SELECT Dept,count(*) FROM employee GROUP BY DEPT;

Hive Query Language


1. JOINS in a table
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID =
o.CUSTOMER_ID);
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON
(c.ID = o.CUSTOMER_ID);

Partitions in Hive Table


Partitions are analogous to dense indexes on columns. Following
are the features of partitions:
They contain nested sub-directories in HDFS for each combination of
partition column values.
They allow users to retrieve rows efficiently.
Partitions Types
Static Partition (Default)
Dynamic Partition

Static Partition
Static Partition is the default type of partition available in hive , in static
partition we have create all the partition of the table manually and have to load
data for each partition separately
1. Creating Partition Table
CREATE TABLE foo (id INT, msg STRING)
PARTITIONED BY (dt STRING);
2. Loading Data in Partition table
LOAD DATA LOCAL INPATH '/home/user/sample.txt'
INTO TABLE employee
PARTITION (country = 'US', state = 'CA');
3. Listing partitions of a table
Show partitions emp;

Static Partition
1. Altering Partition Table
ALTER TABLE employee
ADD PARTITION (year=2013)
location '/2012/part2012';
ALTER TABLE employee PARTITION (year=1203)
RENAME TO PARTITION (Yoj=1203);
ALTER TABLE employee DROP [IF EXISTS]
PARTITION (year=1203);
INSERT OVERWRITE TABLE test_part PARTITION(ds='2009-01-01',
hr=12)
SELECT * FROM t ;
2. Querying a Partition table

Dynamic Partition
Dynamic partition is used to create automatic partition in a table , we dont have to
provide separate file for each partition . By default dynamic partition is disabled for using
dynamic partition we have to enable it first .
Enabling Dynamic Partition
SET
SET
SET
SET

hive.exec.dynamic.partition=true;
hive.exec.max.dynamic.partition=2048;
hive.exec.max.dynamic.partitions.pernode=256; // In case of cluster
hive.exec.dynamic.partition.mode=non-strict;

Loading Data in Dynamic Partition


We cant directly load data in dynamic partition using LOAD command we have to firstly
load the data in a normal table and then load data using that table into partitioned table
Rules for Dynamic Partition
values for partition col must noot be specified
partition col must be specified at the end of the select clause in the same order as they
are specified in the partitioned by clause

Dynamic Partition
Loading Data in Dynamic Partition
CREATE TABLE part_u (
id int, name string)
PARTITIONED BY (
year INT, month INT, day INT);
CREATE TABLE users (
id int, name string , dt DATE)
ROW FORMAT DELIMTED
FIELDS TERMINATED BY , ;
LOAD DATA LOCAL INPATH '/home/user/sample.txt'
INTO TABLE user;
INSERT INTO TABLE part_u PARTITION(year,month,day)
SELECT id,name,year(dt),month(dt),day(dt)
from users;

Bucketing in Hive Tables


Similar to partition we can divide table data in the form of buckets . Buckets divide
the table data into equal size of buckets provided at the time of creation of table .
Buckets is used in place partitions because partitions are not equally distributed
which cause wastage of memory .
Creating Bucketed Table
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id) INTO 4 BUCKETS;
CREATE TABLE weblog (user_id INT, url STRING, source_ip STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 96 BUCKETS;
Listing Records from a particular bucket
SELECT * FROM bucketed_users
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);