Hadoop ABI Insight

Hadoop @ ABI
Insight Into The Ecosystem
Tuesday, October 07, 2014
Dell - Internal Use - Confidential
TEAM - POC
Deepak Gattala
Hadoop Administrator
DW Architect ABI/EBI
Spike White
Linux System Administrator
Kerberos Specialist.
Will OBrian
Active Directory and Identity.
Security Analyst.
Note: Special thanks for supporting the effort.

Bart Crider, Attila Finta, Mike Porreca, Feargal Tobin, Alisha Worsham.
Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)
Agenda [ Next 1 Hour ]

Deepak Gattala [15 minutes]
Get Familiar with Hadoop (Cloudera). [5 minutes]
HDFS & Map reduce tour. [5 minutes]
Hadoop Family and Ecosystem. [5 minutes]
Spike White and Will OBrian [15 minutes]

Integration and Security. [5 minutes]
Kerberos [5 minutes]
AD Forest and OU [5 minutes]
Deepak Gattala [15 minutes]

Understanding Hive and Impala. [5 minutes]
Cloudera Manager. [5 minutes]
DELL IT/Services Use case and Interest. [5 minutes]
Deepak Gattala, Spike White and Will OBrian [15 minutes]

Product Demo. [5 minutes]
Question & Answers. [10 minutes]

Deepak Gattala
- Architect
Photography

What is Hadoop?
Hadoop is an open source software frame work.
Its an Apache top-level project but the underlying technology was
from google white paper about to index all the rich textural and
structural information.
Architected to run on a large number of machines that dont share
any memory or disks.
Hadoop enables distributed parallel processing of huge amounts of
data across inexpensive, industry-standard servers that both store
and process the data, and can scale without limits.
Designed to solve problems with large data while running analytics

that are deep and computationally extensive.

Prerequisites
Hadoop framework mainly consists of two important components: HDFS (Hadoop Distributed File System).
MapReduce paradigm
HDFS is a file system written in Java use for storage similar to ext3 or
ext4 in LINUX.
MapReduce is a programming model for processing and generating
large data sets with a parallel, distributed algorithm on a cluster.
MapReduce is the paradigm used to process data on HDFS, the
processing is moved to the data location.
Basic Linux commands Ex: ls, cat .. Etc.

HDFS
1
Client
Input file
Name Node
TCP/IP
Networking
Data Node
File Blocks
DataNode1
1, 4, 5
DataNode2
2, 3, 4
DataNode3
1, 2, 5
DataNode4
3, 4, 5
DataNode5
1, 2, 3
1
5
2 3
4
DN1
DN2
Metadata
1 2
5
DN3
3 4
5
DN4
1 2
3
DN5
Every piece of data is split into blocks and distributed across cluster.
Typically blocks are 64MB or 128MB, default replication is 3.
8

MapReduce - Example
MapReduce had 5 different stages.

Hadoop Distributions
Even though Hadoop is an open source project, we have some
vendors who actually packaged the compatible version together and
enable the operations tools and provide great flexibility.
Below are the top 3 vendors
Cloudera
Horton Works
MapR
The underlying code still remain bare bone apache open source
however some of them have commercial products and services
attached to specific distributions.
10

Eco-System
11

Hadoop Configuration
Daemons of Hadoop ecosystem: Namenode (Master)
Block information
Secondary Namenode (Master)

Check point of Namenode
Data Node (Slave)

Data residence.
Task Tracker (Slave)

Workers
Job Tracker (Master)

Checks and keeps the status.
Hadoop by default replicates each block of data three times for

redundancy and fail over.
12

Spike White
- System Sr. Engineer
Hardware Configuration
There are 3 different types of node configuration that are very
important in the architecture to get optimal performance.
For small to medium size cluster less than 1000 nodes.

Master Nodes (Generally 2 or 3 in a cluster)
Slave Nodes (Can scale 1 to .. N nodes)
Edge Nodes (Normally 2 for Load balancing)
Each category of node has specific configuration with respect to

the hardware and also Hadoop software.
Please find Dell reference architecture link found below: http://files.cloudera.com/pdf/Dell_Cloudera_Solution_for_Apache_Hadoop_Ref
erence_Architecture.pdf
14

Solution Center Rack Diagram

Location of the rack RR8 EBI Lab.
CM 5.1 and CDH 5.1.2.
2 Name Nodes. (R 720s)

6 Data Nodes. (R 720 XDs)
2 Edge Nodes. (R 720s)
1G network cards (Due upgrade)
Force 10 S 60 Switch 1G (Due Upgrade)
System crashed, bring it offline and

fix it no impact.
Hard drive crashed, replace it and

create the mount.
15

QUEST VINTELA Authentication

Vintela Authentication Services (VAS)
implements Kerberos and LDAP
functionality on UNIX and Linux
systems, and fully integrate with AD.
The benefits of using VAS include the
following:
You have the ability to manage UNIX
and Linux users and computers are
managed through the Active Directory
Users and Computers Microsoft
Management Console (MMC) snap-in.
Kerberos is the protocol used to secure
LDAP traffic.
Performance is tuned to work
effectively with Active Directory.
16

Kerberos
The Kerberos protocol is a standard designed to provide strong
authentication within a client/server network environment.
Kerberos network messages are encrypted and decrypted using

algorithms that are very difficult to decode into its original form.
Kerberos contains a number of terms
Principal:- All entities within Kerberos, including users, computers, and
services, are known as principals. Principal names are unique.
Realms: -The principal is a member of a realm.
Ticket: - A ticket is the fundamental unit of Kerberos authentication. It is a
carefully constructed message containing the authentication information
which is passed between computers.
Key Distribution Center: -The Key Distribution Center (KDC) is made up of
three components:
Database of principals containing users, computers, and services;
Authentication server that issues Ticket Granting Tickets (TGT);
Ticket Granting Service (TGS) that issues service tickets granting clients access to
specific services.
17

Kerberos
18

Will OBrian
- Active Directory Services
Active Directory
The US-POCLAB.DELLPOC.COM Active Directory (AD) Domain was

utilized for the Cloudera setup.
A Hadoop Organizational Unit (OU) was manually created under uspoclab.dellpoc.com/Unix/Servers.
A parent Service Account (Servicegtminf) was manually created

under the us-poclab.dellpoc.com/Service Accounts OU.
20

Active Directory
CM uses the service account
servicegtminf.
The Servicegtminf was given rights
to create\delete accounts within
us-poclab.dellpoc.com/Hadoop
OU as well as Full Control rights to
any descendant objects
(accounts).
Service accounts are create by CM
by changing the user principles
name.
Account serviceARFSqfwFob is
configured to be utilized for the
sentry service running on
ausgtmhadoop07.
21

Active Directory
22

Deepak Gattala
- Architect
Word Count Quiz:- What you choose?

Using Mapreduce
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}}
24

Using PIG
a = load '/user/hue/word_count_text.txt';
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word; d = foreach c generate COUNT(b),
group;
store d into '/user/hue/pig_wordcount';
Using Hive
Select word, count(*) from file_table group

by word;
Hive
Facebook uses Hadoop extensively, looking for way to allow nonJava programmers access to the data in its Hadoop clusters.
Data analysts, Statisticians, Data Scientists etc.
In Hive - SQL SELECT statement => MapReduce translator

Takes Hive queries and turns them into Java MapReduce code and then
Submits the code to the cluster
Displays the results back to the user. Note: Not all SQL works!
Hive is much easier to learn than Java-based MapReduce

Writing HiveQL queries is much faster than writing the equivalent Java
code.
Many people already know SQL Can rapidly start using Hive to query
and manipulate data in the cluster.
25

Hive - Authorization
CREATE ROLE [ ROLE NAME];
DROP ROLE [ ROLE NAME ];
GRANT ROLE role_name [, role_name] TO GROUP <groupName> [,GROUP
<groupName>];
REVOKE ROLE role_name [, role_name] FROM GROUP <groupName>

[,GROUP <groupName>];
GRANT <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> TO ROLE

<roleName> [,ROLE <roleName>];
REVOKE <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> FROM

ROLE <roleName> [,ROLE <roleName>];
POC uses Groups: gtm_hdp_inf_dev Hive Group used for POC
gtm_hdp_inf_adm Cloudera Manager Admin Group
26

Hive - Authorization
The Object Hierarchy where you can apply security can be as
granular as below:-
27

DELL ABI Use Cases

SAIE (Support Assist Intelligence Engine) [Design & Architecture]
Teradata Appliance (PROD) & Home grown (Horton works 2.1) - DEV/SIT
DCCMT/NGMT Hadoop Reporting [POC]

Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]
Server log analysis POC on Hadoop [POC]

Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]
Big Data Edition ETL Use case. [POC]

Informatica 9.6.1 & Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]
MAW (Marketing Analytics Workbench) [Beta Production]

Teradata Appliance HDP 1.3.2 [ Due upgrade HDP 2.1 ]
Rainstor Archival Strategy [In Production]

Cloudera CDH 4.2
28

Cloudera Manager
Cloudera provides the web interface for the cluster management.
29

Questions???
Cloudera Hadoop Demo
31


Hadoop ABI Insight

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hadoop ABI Insight

Загружено:

Авторское право:

Доступные форматы

Hadoop @ ABI

Insight Into The Ecosystem

Tuesday, October 07, 2014

Dell - Internal Use - Confidential

Note: Special thanks for supporting the effort.

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Agenda [ Next 1 Hour ]

Spike White and Will OBrian [15 minutes]

Deepak Gattala [15 minutes]

Deepak Gattala, Spike White and Will OBrian [15 minutes]

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Dell - Internal Use - Confidential

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Designed to solve problems with large data while running analytics

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Secondary Namenode (Master)

Data Node (Slave)

Task Tracker (Slave)

Job Tracker (Master)

Hadoop by default replicates each block of data three times for

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Dell - Internal Use - Confidential

For small to medium size cluster less than 1000 nodes.

Each category of node has specific configuration with respect to

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Solution Center Rack Diagram

2 Name Nodes. (R 720s)

System crashed, bring it offline and

Hard drive crashed, replace it and

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

QUEST VINTELA Authentication

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Kerberos network messages are encrypted and decrypted using

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Dell - Internal Use - Confidential

The US-POCLAB.DELLPOC.COM Active Directory (AD) Domain was

A parent Service Account (Servicegtminf) was manually created

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)

Analytics and BI (ABI) | Dell IT

Enterprise Business Intelligence (EBI)