Вы находитесь на странице: 1из 31

Hadoop @ ABI

Insight Into The Ecosystem

Tuesday, October 07, 2014

Dell - Internal Use - Confidential

TEAM - POC
Deepak Gattala
Hadoop Administrator
DW Architect ABI/EBI

Spike White
Linux System Administrator
Kerberos Specialist.

Will OBrian
Active Directory and Identity.
Security Analyst.

Note: Special thanks for supporting the effort.


Bart Crider, Attila Finta, Mike Porreca, Feargal Tobin, Alisha Worsham.

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Agenda [ Next 1 Hour ]


Deepak Gattala [15 minutes]
Get Familiar with Hadoop (Cloudera). [5 minutes]
HDFS & Map reduce tour. [5 minutes]
Hadoop Family and Ecosystem. [5 minutes]

Spike White and Will OBrian [15 minutes]


Integration and Security. [5 minutes]
Kerberos [5 minutes]
AD Forest and OU [5 minutes]

Deepak Gattala [15 minutes]


Understanding Hive and Impala. [5 minutes]
Cloudera Manager. [5 minutes]
DELL IT/Services Use case and Interest. [5 minutes]

Deepak Gattala, Spike White and Will OBrian [15 minutes]


Product Demo. [5 minutes]
Question & Answers. [10 minutes]

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Deepak Gattala
- Architect

Dell - Internal Use - Confidential

Photography

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

What is Hadoop?
Hadoop is an open source software frame work.
Its an Apache top-level project but the underlying technology was
from google white paper about to index all the rich textural and
structural information.
Architected to run on a large number of machines that dont share
any memory or disks.
Hadoop enables distributed parallel processing of huge amounts of
data across inexpensive, industry-standard servers that both store
and process the data, and can scale without limits.

Designed to solve problems with large data while running analytics


that are deep and computationally extensive.

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Prerequisites
Hadoop framework mainly consists of two important components: HDFS (Hadoop Distributed File System).
MapReduce paradigm

HDFS is a file system written in Java use for storage similar to ext3 or
ext4 in LINUX.
MapReduce is a programming model for processing and generating
large data sets with a parallel, distributed algorithm on a cluster.
MapReduce is the paradigm used to process data on HDFS, the
processing is moved to the data location.
Basic Linux commands Ex: ls, cat .. Etc.

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

HDFS
1

Client

Input file

Name Node
TCP/IP
Networking

Data Node

File Blocks

DataNode1

1, 4, 5

DataNode2

2, 3, 4

DataNode3

1, 2, 5

DataNode4

3, 4, 5

DataNode5

1, 2, 3

1
5

2 3
4

DN1

DN2

Metadata

1 2
5
DN3

3 4
5
DN4

1 2
3
DN5

Every piece of data is split into blocks and distributed across cluster.
Typically blocks are 64MB or 128MB, default replication is 3.
8

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

MapReduce - Example
MapReduce had 5 different stages.

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Hadoop Distributions
Even though Hadoop is an open source project, we have some
vendors who actually packaged the compatible version together and
enable the operations tools and provide great flexibility.
Below are the top 3 vendors
Cloudera
Horton Works
MapR

The underlying code still remain bare bone apache open source
however some of them have commercial products and services
attached to specific distributions.

10

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Eco-System

11

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Hadoop Configuration
Daemons of Hadoop ecosystem: Namenode (Master)
Block information

Secondary Namenode (Master)


Check point of Namenode

Data Node (Slave)


Data residence.

Task Tracker (Slave)


Workers

Job Tracker (Master)


Checks and keeps the status.

Hadoop by default replicates each block of data three times for


redundancy and fail over.

12

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Spike White
- System Sr. Engineer

Dell - Internal Use - Confidential

Hardware Configuration
There are 3 different types of node configuration that are very
important in the architecture to get optimal performance.

For small to medium size cluster less than 1000 nodes.


Master Nodes (Generally 2 or 3 in a cluster)
Slave Nodes (Can scale 1 to .. N nodes)
Edge Nodes (Normally 2 for Load balancing)

Each category of node has specific configuration with respect to


the hardware and also Hadoop software.
Please find Dell reference architecture link found below: http://files.cloudera.com/pdf/Dell_Cloudera_Solution_for_Apache_Hadoop_Ref
erence_Architecture.pdf

14

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Solution Center Rack Diagram


Location of the rack RR8 EBI Lab.
CM 5.1 and CDH 5.1.2.

2 Name Nodes. (R 720s)


6 Data Nodes. (R 720 XDs)
2 Edge Nodes. (R 720s)
1G network cards (Due upgrade)
Force 10 S 60 Switch 1G (Due Upgrade)

System crashed, bring it offline and


fix it no impact.

Hard drive crashed, replace it and


create the mount.

15

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

QUEST VINTELA Authentication


Vintela Authentication Services (VAS)
implements Kerberos and LDAP
functionality on UNIX and Linux
systems, and fully integrate with AD.
The benefits of using VAS include the
following:
You have the ability to manage UNIX
and Linux users and computers are
managed through the Active Directory
Users and Computers Microsoft
Management Console (MMC) snap-in.
Kerberos is the protocol used to secure
LDAP traffic.
Performance is tuned to work
effectively with Active Directory.

16

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Kerberos
The Kerberos protocol is a standard designed to provide strong
authentication within a client/server network environment.

Kerberos network messages are encrypted and decrypted using


algorithms that are very difficult to decode into its original form.
Kerberos contains a number of terms
Principal:- All entities within Kerberos, including users, computers, and
services, are known as principals. Principal names are unique.
Realms: -The principal is a member of a realm.
Ticket: - A ticket is the fundamental unit of Kerberos authentication. It is a
carefully constructed message containing the authentication information
which is passed between computers.
Key Distribution Center: -The Key Distribution Center (KDC) is made up of
three components:
Database of principals containing users, computers, and services;
Authentication server that issues Ticket Granting Tickets (TGT);
Ticket Granting Service (TGS) that issues service tickets granting clients access to
specific services.
17

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Kerberos

18

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Will OBrian
- Active Directory Services

Dell - Internal Use - Confidential

Active Directory

The US-POCLAB.DELLPOC.COM Active Directory (AD) Domain was


utilized for the Cloudera setup.
A Hadoop Organizational Unit (OU) was manually created under uspoclab.dellpoc.com/Unix/Servers.

A parent Service Account (Servicegtminf) was manually created


under the us-poclab.dellpoc.com/Service Accounts OU.
20

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Active Directory
CM uses the service account
servicegtminf.
The Servicegtminf was given rights
to create\delete accounts within
us-poclab.dellpoc.com/Hadoop
OU as well as Full Control rights to
any descendant objects
(accounts).
Service accounts are create by CM
by changing the user principles
name.
Account serviceARFSqfwFob is
configured to be utilized for the
sentry service running on
ausgtmhadoop07.
21

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Active Directory

22

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Deepak Gattala
- Architect

Dell - Internal Use - Confidential

Word Count Quiz:- What you choose?


Using Mapreduce

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}}

24

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Using PIG
a = load '/user/hue/word_count_text.txt';
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word; d = foreach c generate COUNT(b),
group;
store d into '/user/hue/pig_wordcount';

Using Hive

Select word, count(*) from file_table group


by word;

Enterprise Business Intelligence (EBI)

Hive
Facebook uses Hadoop extensively, looking for way to allow nonJava programmers access to the data in its Hadoop clusters.
Data analysts, Statisticians, Data Scientists etc.

In Hive - SQL SELECT statement => MapReduce translator


Takes Hive queries and turns them into Java MapReduce code and then
Submits the code to the cluster
Displays the results back to the user. Note: Not all SQL works!

Hive is much easier to learn than Java-based MapReduce


Writing HiveQL queries is much faster than writing the equivalent Java
code.
Many people already know SQL Can rapidly start using Hive to query
and manipulate data in the cluster.

25

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Hive - Authorization
CREATE ROLE [ ROLE NAME];
DROP ROLE [ ROLE NAME ];
GRANT ROLE role_name [, role_name] TO GROUP <groupName> [,GROUP
<groupName>];

REVOKE ROLE role_name [, role_name] FROM GROUP <groupName>


[,GROUP <groupName>];

GRANT <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> TO ROLE


<roleName> [,ROLE <roleName>];

REVOKE <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> FROM


ROLE <roleName> [,ROLE <roleName>];
POC uses Groups: gtm_hdp_inf_dev Hive Group used for POC
gtm_hdp_inf_adm Cloudera Manager Admin Group
26

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Hive - Authorization
The Object Hierarchy where you can apply security can be as
granular as below:-

27

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

DELL ABI Use Cases


SAIE (Support Assist Intelligence Engine) [Design & Architecture]
Teradata Appliance (PROD) & Home grown (Horton works 2.1) - DEV/SIT

DCCMT/NGMT Hadoop Reporting [POC]


Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

Server log analysis POC on Hadoop [POC]


Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

Big Data Edition ETL Use case. [POC]


Informatica 9.6.1 & Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

MAW (Marketing Analytics Workbench) [Beta Production]


Teradata Appliance HDP 1.3.2 [ Due upgrade HDP 2.1 ]

Rainstor Archival Strategy [In Production]


Cloudera CDH 4.2

28

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Cloudera Manager
Cloudera provides the web interface for the cluster management.

29

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Questions???

Dell - Internal Use - Confidential

Cloudera Hadoop Demo

31

Analytics and BI (ABI) | Dell IT


Dell - Internal Use - Confidential

Enterprise Business Intelligence (EBI)

Вам также может понравиться